Supplementary Material II

Every cog and wheel: Unraveling biocomplexity at the genomic and phenotypic level in a population complex of Chinook salmon.

Shannon J. O’Leary, Tasha Q. Thompson, Mariah H. Meek

1 Genotyping

Reads were mapped to the Christensen 2018 chinook genome and SNPs were called using freebayes. Filtered data set was haplotyped using rad_haplotyper. Fst-outlier were identified using FDIST method implemented in Arlequin and bayescan.

## 
## FULL DATA SET: all SNP-containing loci/all individuals grouped by run type.
## /// GENIND OBJECT /////////
## 
##  // 386 individuals; 12,983 loci; 30,037 alleles; size: 52.2 Mb
## 
##  // Basic content
##    @tab:  386 x 30037 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-9)
##    @loc.fac: locus factor for the 30037 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: .local(x = x, i = i, j = j, drop = drop)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 21-270)
##    @strata: a data frame with 10 columns ( LIB, LIB_ID, SAMPLE_ID, YEAR, SOURCE, RUN, ... )

Number of samples per tributary and run type.

LOCATION Fall Late-Fall Winter Spring
COL 30 - - -
MIL 20 - - 16
DER 15 - - 27
BUT 21 - - 19
FRH 27 - - 7
NIM 30 - - -
MKH 28 - - -
STN 23 - - -
TOU 30 - - -
MRH 15 - - -
MER 31 - - -
USR - 21 26 -

For details on bioinformatics on genotyping the data set see Supplementary Material I

2 Assessment of population structure and differentiation

Population structure was explored using two methods, a clustering analysis based on genetic similarity and an assessment of population differentiation among individuals grouped a priori based on run type and tributary.

2.1 PCA

Reduce the data’s dimensionality using a principle component analysis and project genotypes into two dimensional space visualize sample’s distances.

# create matrix with centered/scaled allele frequencies
# will also replace missing data
X <- scaleGen(gen, NA.method = "mean")

# perform PCA/retain top 3 PCs
PCA <- dudi.pca(X, center = FALSE, scale = FALSE,
                scannf = FALSE, nf = 10)

# extract Eigenvalues of each PC & calculate % variance explained by each PC
eig <- eigenvalues(PCA)

# # plot %variance explained by of top 25 PCs
# plot.eigen.variance(eig)

# Individuals' contribution to PCs/calculate Loading per individual and PC
PC_ind <- PC.ind(PCA)

# join individuals' contribution to PC with strata
PC_ind <- left_join(PC_ind, strata)

PC1 (2.5% of variation) separates Winter run individuals as being clearly distinct. Spring individuals cluster directly adjacent to Fall. PC2 (1.1% of variation) has a tight cluster of Late-Fall and Fall run individuals with Spring run clustering in a more spread out fashion. Spring Feather River Hatchery individuals cluster on the outskirts of the Fall cluster with Fall Feather River wild individuals.

Principal Component Analysis. Color of each data point indicates tributary of that individual. Shape of data point indicates run type.

Principal Component Analysis. Color of each data point indicates tributary of that individual. Shape of data point indicates run type.

2.2 k-mean clustering

Individuals were clustered into K = 2 – 10 groups using k-means clustering based on the PCA-transformed genotype matrix (i.e. no assumptions regarding Hardy-Weinberg or linkage disequilibrium) followed by a discriminant analysis of principle components (DAPC) to determine the membership probabilities of each sample to each inferred cluster as implemented in adegenet (Jombart, Devillard, and Balloux 2010). During DAPC, the genotype matrix is first PCA-transformed, then samples are partitioned into a within- and between-group component to maximize discrimination between groups using linear discriminant analysis, and finally, the membership probability of each sample to each cluster is calculated. To ensure sufficient variance was retained to discriminate among groups but not overfit the data, the optimum number of principle components to retain was determined using a stratified cross-validation of DAPC.

Use k-means clustering to identify clusters based on genetic similarity (data transformed using PCA). For clustering it is appropriate to retain all PCs (all variance in the data set), as overfitting is not an issue.

Figure S1: AIC for data clustered into K = 1 - 40 groups using k-means clustering.

Figure S1: AIC for data clustered into K = 1 - 40 groups using k-means clustering.

AIC is minimized for K=4.

Perform stratified cross-validation of DAPC using a range of retained PCs, while keeping the number of discriminant functions fixed for K = 2 - 10 to determine the most appropriate number of principle components to retain sufficient variance to discriminate but not overfit the data.

## define minimum and maximum K to cluster for ----
min_K <- 2

max_K <- 10

# number of repetitions for x-validation
n_rep <- 30

# range of values to loop over
range <- min_K:max_K

## loop over values for K ----

# list for cross validation plots
xVal_plots <- list()

# data frame for optimum PCs
opt_PC <- data.frame(matrix(ncol = 4, nrow = 0)) %>%
    rename(k_clust = X1,
           optPC = X2,
           variance = X3,
           mean_opt_success = X4)

# list for membership plots
memb_plots <- list()


# run loop
for (k in range){

    # set value for K
    k_clust <- k

    # cluster
    grp <- find.clusters.genind(gen, n.pca = 500, n.clust = k_clust, method = "kmeans")

    # scale allele frq/missing data replaced with mean
    X <- scaleGen(gen, NA.method = "mean")

    xval <- xvalDapc(X, grp$grp,
                     n.pca.max = 400, training.set = 0.9,
                     result = "groupMean", center = TRUE, scale = FALSE,
                     n.pca = NULL, n.rep = n_rep, xval.plot = FALSE)

    # extract assignment success for each xVal step
    PCs <- data.frame(N_PCs = xval[["Cross-Validation Results"]]$n.pca,
                      PROP_SUCCESS = xval[["Cross-Validation Results"]]$success)

    # create crossvalidation plot
    xVal_plots[[k_clust]] <- ggplot(PCs, aes(x = N_PCs, y = PROP_SUCCESS)) +
      stat_density2d(aes(fill = ..density..^0.25),
                     geom = "tile", contour = FALSE, n = 200) +   
      geom_point() +
      scale_fill_continuous(low = "white", high = "dodgerblue4") +
      scale_y_continuous(limits = c(0, 1)) +
      labs(title = glue("Cross validation K = {k_clust}"),
           x = "retained PCs", y = "assignment success") +
      theme_standard +
      theme(legend.position = "none")

    # optimum number of PCs to retain
    retain <- as.numeric(xval$`Number of PCs Achieving Highest Mean Success`)

    variance <- round(xval[["DAPC"]]$var*100, digits = 2)

    opt <- PCs %>%
      filter(N_PCs == retain)

    df <- as.data.frame(x = k_clust) %>%
      mutate(optPC = retain,
             variance = variance,
             mean_opt_success = round(mean(opt$PROP_SUCCESS), digits = 2))

    opt_PC <- opt_PC %>%
        bind_rows(df)

    # extract membership assignment probabilities
    grp_membership <- as.data.frame(xval[["DAPC"]]$posterior) %>%
      rownames_to_column("LIB_ID") %>%
      gather(key = GRP, value = MEMBSHIP, 2:(k_clust+1)) %>%
      left_join(strata) %>%
      unite(RUN_LOC, RUN, LOCATION, sep = "_", remove = FALSE) %>%
      arrange(RUN_LOC, LIB_ID) %>%
      mutate(LIB_ID = factor(LIB_ID, levels = unique(LIB_ID)))

    path <- as.character(glue("results/runs_K{k}.grp.membership"))

    write_delim(grp_membership, path, delim = "\t")

    # create membership plot
    memb_plots[[k_clust]] <- ggplot(grp_membership,
                                    aes(x = LIB_ID, y = MEMBSHIP, fill = GRP, color = GRP)) +
      geom_bar(stat = "identity") +
      facet_grid(. ~ RUN_LOC, scales = "free", space = "free") +
      scale_fill_manual(values = col_nine) +
      scale_color_manual(values = col_nine) +
      labs(x = "INDV", y = "memb. prob") +
      theme_standard +
      theme(axis.text.x = element_blank(),
        strip.text.x = element_text(size = 6, color = "black"),
        strip.text.y = element_text(size = 6, color = "black"),
        legend.position = "none")

}

write_delim(opt_PC, "results/kmeans_all-loci.optPCs", delim = "\t")

Retaining too few PCs may result in important variance not being retained and therefore not informing the clustering analysis, while retaining too many PCs results in overfitting, i.e. assignment success decreases.

Proportion of test individuals correctly assigned to their cluster of origin based on number of retained PCs for K = 2 - 9.

Proportion of test individuals correctly assigned to their cluster of origin based on number of retained PCs for K = 2 - 9.

Table S1: The optimum number of principle components,%-variance retained and the mean optimum assignment succes for K = 2 -10 clusters.

k_clust optPC variance mean_opt_success
2 50 63.49 1.00
3 50 21.77 1.00
4 100 36.98 0.99
5 100 36.98 0.97
7 50 21.77 0.97
8 20 11.62 0.96
6 50 21.77 0.95
9 40 18.50 0.87
10 80 31.08 0.82

Genetic data is scaled and centered, then transformed using a PCA. Retained PCs are then transmitted to a linear discriminant analysis.

Calculate membership probability of each individual to per cluster.

Membership probability of each individual to clusters identified using k-means hierarchical clustering (equiavalent to Figure 2 in manuscript).

Membership probability of each individual to clusters identified using k-means hierarchical clustering (equiavalent to Figure 2 in manuscript).

Tributaries are grouped within runs by tributary from north to south as Fall_BUT, Fall_COL, Fall_DER, Fall_FRH, Fall_MER, Fall_MIL, Fall_MKH, Fall_MRH, Fall_NIM, Fall_STN, Fall_TOU, Late-Fall_USR, Spring_BUT, Spring_DER, Spring_FRH, Spring_MIL, and Winter_USR (individual panels).

2.3 F-statistics

Weir & Cockerham’s unbiased estimator of FST (Weir and Cockerham 1984) was used to calculate population differentiation among individuals grouped a priori by run type within tributary.

To test for genetic heterogeneity in the data set, global FST was calculated across all groups.

# group genetic data for comparison
setPop(gen) <- ~RUN_LOC

pop <- popNames(gen)

# calculate Fst
tidy <- tidy_genomic_data(data = gen, filename = NULL, parallel = 65)

fst <- assigner::fst_WC84(data = tidy,
                          pop.levels = pop,
                          holdout.samples = NULL,
                          pairwise = TRUE,
                          ci = TRUE,
                          iteration.ci = 1000,
                          quantiles.ci = c(0.025, 0.975),
                          digits = 9,
                          verbose = TRUE)

# write results
write_delim(fst$fst.overall, "results/trib_allLoc.globalfst.ci", delim = "\t")

write_delim(fst[["fst.ranked"]], "results/trib_ranked.perlocus.fst", delim = "\t")

write_delim(fst[["fis.markers"]], "results/trib_perlocus.fis", delim = "\t")

write_delim(fst$pairwise.fst, "results/trib_allLoc.fst.ci", delim = "\t")

write_delim(fst$sigma.loc, "results/trib_perlocus.sigma", delim = "\t")

write_delim(fst$fis.markers, "trib_perlocus.fis", delim = "\t")

write_delim(fst$fis.overall, "results/trib_allLoc.fis", delim = "\t")

Significant heterogeneity was detected among individuals grouped by run/tributary. Significance was determined using 95% confidence intervals around each estimate generated by re-sampling loci 1,000 times using assigner (Gosselin, Anderson, and Bradbury 2016).

Global Fst and 95% CI for individuals grouped by runtype.

FST N_MARKERS CI_LOW CI_HIGH
0.0318754 12886 0.0309843 0.032883

Pairwise FST was calculated as a post hoc test for pairwise differences among groups. Significance was determined using 95% confidence intervals around each estimate generated by re-sampling loci 1,000 times.

Table S2: Pairwise Fst among runs basins.

POP1 POP2 COMP N_MARKERS FST CI_LOW CI_HIGH
F_MRH F_DER F-F 9448 0.018 0.017 0.019
F_MRH F_MIL F-F 9594 0.016 0.015 0.017
F_MRH F_BUT F-F 9600 0.015 0.013 0.016
F_DER F_MIL F-F 9678 0.014 0.013 0.016
F_MRH F_STN F-F 9688 0.013 0.012 0.014
F_DER F_STN F-F 9784 0.012 0.011 0.013
F_DER F_BUT F-F 9667 0.012 0.010 0.012
F_MIL F_BUT F-F 9798 0.011 0.010 0.012
F_MIL F_STN F-F 9896 0.011 0.010 0.012
F_BUT F_STN F-F 9848 0.008 0.007 0.009
F_FRH F_MRH F-F 10026 0.008 0.007 0.009
F_MRH F_COL F-F 10129 0.008 0.007 0.008
F_MRH F_TOU F-F 9863 0.007 0.006 0.008
F_FRH F_DER F-F 10113 0.007 0.006 0.008
F_DER F_TOU F-F 9939 0.007 0.006 0.007
F_MKH F_MRH F-F 9852 0.006 0.006 0.007
F_MIL F_TOU F-F 10068 0.006 0.005 0.007
F_FRH F_BUT F-F 10186 0.006 0.005 0.007
F_FRH F_MIL F-F 10227 0.006 0.005 0.007
F_COL F_STN F-F 10328 0.006 0.005 0.007
F_MKH F_DER F-F 9934 0.006 0.005 0.007
F_FRH F_STN F-F 10250 0.006 0.005 0.007
F_MKH F_MIL F-F 10034 0.006 0.005 0.007
F_MRH F_NIM F-F 9948 0.006 0.005 0.006
F_DER F_NIM F-F 10047 0.006 0.005 0.006
F_DER F_COL F-F 10130 0.005 0.004 0.006
F_BUT F_COL F-F 10250 0.005 0.004 0.006
F_MRH F_MER F-F 9956 0.005 0.004 0.006
F_MIL F_NIM F-F 10146 0.005 0.004 0.005
F_DER F_MER F-F 10048 0.005 0.004 0.005
F_MIL F_COL F-F 10249 0.005 0.004 0.005
F_MIL F_MER F-F 10151 0.004 0.004 0.005
F_BUT F_TOU F-F 10024 0.004 0.003 0.005
F_MKH F_BUT F-F 10005 0.004 0.003 0.005
F_TOU F_COL F-F 10432 0.004 0.003 0.004
F_BUT F_NIM F-F 10106 0.003 0.003 0.004
F_MKH F_STN F-F 10061 0.003 0.002 0.004
F_TOU F_STN F-F 10075 0.003 0.002 0.004
F_NIM F_COL F-F 10479 0.003 0.002 0.004
F_MKH F_COL F-F 10426 0.003 0.002 0.003
F_FRH F_TOU F-F 10387 0.003 0.002 0.003
F_BUT F_MER F-F 10116 0.002 0.002 0.003
F_NIM F_STN F-F 10144 0.002 0.001 0.003
F_MKH F_FRH F-F 10358 0.002 0.001 0.003
F_MER F_STN F-F 10173 0.002 0.001 0.003
F_MER F_COL F-F 10495 0.002 0.001 0.002
F_FRH F_NIM F-F 10433 0.002 0.001 0.002
F_FRH F_COL F-F 10545 0.001 0.001 0.002
F_FRH F_MER F-F 10417 0.001 0.000 0.002
F_MKH F_TOU F-F 10186 0.000 0.000 0.001
F_MKH F_MER F-F 10252 0.000 0.000 0.000
F_MKH F_NIM F-F 10259 0.000 0.000 0.000
F_TOU F_MER F-F 10282 0.000 0.000 0.000
F_TOU F_NIM F-F 10306 0.000 0.000 0.000
F_MER F_NIM F-F 10336 0.000 0.000 0.000
S_BUT S_FRH S-S 8863 0.050 0.048 0.052
S_MIL S_FRH S-S 9009 0.047 0.045 0.049
S_MIL S_BUT S-S 9226 0.036 0.034 0.038
S_DER S_FRH S-S 9710 0.027 0.025 0.028
S_DER S_BUT S-S 9764 0.023 0.022 0.024
S_DER S_MIL S-S 9838 0.011 0.010 0.012
F_MRH L_USR F-L 9578 0.019 0.018 0.020
F_DER L_USR F-L 9665 0.017 0.016 0.019
F_MIL L_USR F-L 9831 0.016 0.015 0.017
F_BUT L_USR F-L 9749 0.015 0.014 0.016
F_STN L_USR F-L 9888 0.015 0.014 0.016
F_FRH L_USR F-L 10243 0.012 0.011 0.013
F_MKH L_USR F-L 10074 0.012 0.011 0.013
F_TOU L_USR F-L 10052 0.011 0.010 0.012
F_COL L_USR F-L 10252 0.010 0.009 0.012
F_NIM L_USR F-L 10151 0.010 0.009 0.011
F_MER L_USR F-L 10159 0.009 0.008 0.009
F_STN W_USR F-W 9545 0.162 0.157 0.167
F_BUT W_USR F-W 9370 0.156 0.151 0.162
F_MKH W_USR F-W 9744 0.156 0.151 0.161
F_TOU W_USR F-W 9800 0.156 0.151 0.161
F_NIM W_USR F-W 9920 0.156 0.151 0.161
F_MER W_USR F-W 9914 0.152 0.147 0.158
F_MRH W_USR F-W 9052 0.152 0.147 0.157
F_MIL W_USR F-W 9385 0.152 0.146 0.157
F_DER W_USR F-W 9191 0.149 0.144 0.154
F_FRH W_USR F-W 9946 0.146 0.141 0.151
F_COL W_USR F-W 9934 0.141 0.136 0.146
F_MRH S_BUT F-S 9420 0.051 0.049 0.053
F_STN S_BUT F-S 9874 0.049 0.047 0.052
F_DER S_BUT F-S 9561 0.049 0.047 0.051
F_BUT S_BUT F-S 9720 0.047 0.045 0.050
F_MIL S_BUT F-S 9721 0.047 0.045 0.049
F_MKH S_BUT F-S 10075 0.044 0.042 0.046
F_TOU S_BUT F-S 10045 0.044 0.042 0.046
F_MRH S_MIL F-S 9570 0.044 0.042 0.046
F_NIM S_BUT F-S 10190 0.043 0.041 0.045
F_MER S_BUT F-S 10185 0.040 0.039 0.042
F_MRH S_FRH F-S 8928 0.039 0.037 0.041
F_DER S_MIL F-S 9650 0.038 0.036 0.040
F_FRH S_BUT F-S 10190 0.037 0.035 0.039
F_STN S_MIL F-S 9913 0.036 0.034 0.038
F_COL S_BUT F-S 10263 0.036 0.034 0.038
F_MIL S_MIL F-S 9805 0.036 0.034 0.038
F_DER S_FRH F-S 9050 0.036 0.034 0.038
F_BUT S_MIL F-S 9799 0.036 0.034 0.038
F_MKH S_MIL F-S 10111 0.031 0.029 0.032
F_TOU S_MIL F-S 10112 0.031 0.029 0.032
F_STN S_DER F-S 10413 0.030 0.029 0.032
F_MRH S_DER F-S 10115 0.030 0.028 0.032
F_MIL S_FRH F-S 9266 0.030 0.028 0.031
F_BUT S_FRH F-S 9264 0.029 0.028 0.031
F_BUT S_DER F-S 10302 0.029 0.027 0.031
F_NIM S_MIL F-S 10235 0.029 0.027 0.030
F_DER S_DER F-S 10189 0.029 0.027 0.030
F_MKH S_DER F-S 10546 0.028 0.027 0.030
F_NIM S_DER F-S 10662 0.028 0.026 0.029
F_TOU S_DER F-S 10544 0.027 0.026 0.029
F_MIL S_DER F-S 10317 0.027 0.025 0.029
F_MER S_MIL F-S 10219 0.027 0.025 0.028
F_STN S_FRH F-S 9382 0.027 0.025 0.028
F_MER S_DER F-S 10620 0.025 0.024 0.027
F_FRH S_MIL F-S 10235 0.024 0.023 0.026
F_COL S_MIL F-S 10297 0.023 0.021 0.024
F_FRH S_DER F-S 10637 0.021 0.019 0.022
F_COL S_DER F-S 10702 0.021 0.019 0.023
F_MKH S_FRH F-S 9627 0.018 0.017 0.020
F_TOU S_FRH F-S 9654 0.018 0.017 0.019
F_NIM S_FRH F-S 9790 0.017 0.016 0.018
F_COL S_FRH F-S 9885 0.014 0.013 0.016
F_MER S_FRH F-S 9762 0.014 0.013 0.015
F_FRH S_FRH F-S 9754 0.014 0.013 0.015
W_USR L_USR W-L 9355 0.160 0.155 0.165
S_BUT L_USR S-L 9685 0.051 0.049 0.054
S_MIL L_USR S-L 9808 0.038 0.036 0.041
S_DER L_USR S-L 10307 0.032 0.030 0.035
S_FRH L_USR S-L 9252 0.031 0.029 0.032
W_USR S_BUT W-S 8812 0.159 0.154 0.164
W_USR S_MIL W-S 8985 0.143 0.139 0.148
W_USR S_DER W-S 9717 0.141 0.136 0.146
W_USR S_FRH W-S 8288 0.122 0.118 0.126

Figure S2: Pairwise Fst of individuals grouped by run/tributary. Fst-values denoted with * have lower CI > 0

Figure S2: Pairwise Fst of individuals grouped by run/tributary. Fst-values denoted with * have lower CI > 0

3 Assessment of genomic diversity

For all measures of genetic diversity comparisons were made for individuals grouped by run type within each tributary, with wild and hatchery individuals treated as separate groups even if they were sampled in the same tributary. Sample sizes among run/tributary groups are comparable with most in the range of 20 - 30, though only seven Feather River Spring individuals are present.

The following parameters are compared:

For each parameter, significant heterogeneity among groups was determined using a Friedman’s test, which implements a repeated block design (alleles observed per locus present in each sample). Next, significance of pairwise comparisons was tested using a Wilcoxon signed rank test; p-values were adjusted assuming a FDR of 0.05. The use of the signed rank test gives “direction” to the test of significance, due to the block design, i.e. pairwise comparisons show that loci will consistently be lower (higher) across individuals as opposed to a test that does not account for the fact that the same blocks/loci are being observed (e.g. a t-test).

For Tajima’s D, additionally we generated a genome-wide null distribution of Tajima’s \(D\) for a set of neutral loci assuming mutation drift-equilibrium reflecting the composition of the haplotyped empirical data set consisting of the same number of loci with same distribution of segregating sites by simulating loci under coalescence using MS (Hudson 2002) to compare to the empirical data set.

Results of tests of significance for comparisons are presented as a heatmap indicating the test statistic of pairwise comparisons and printed p-values, significantly different comparisons (P < 0.05) is presented. Finally, the distribution of values across loci for individuals grouped by run or run/tributary are presented as boxplots and a table with summary statistics.

To compare whether identified loci are randomly distributed across chromosomes, null distributions were generated by shuffling chromosome designations across loci 1,000 times to determine whether the observed values fall outside the null distribution.

3.1 Calculate measures of genetic diversity based on allele frequency and heterozygosity

# define groups to analyze ====
setPop(gen) <- ~RUN_LOC

gen_grp <- seppop(gen)

gen_grp[["ALL"]] <- gen

# calculate diversity stats ====
loc_stats <- list()

for (p in names(gen_grp)) {

locA <- locus_table(gen_grp[[p]], index = "shannon") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

locB <- locus_table(gen_grp[[p]], index = "simpson") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

temp <- left_join(locA, locB)

locC <- locus_table(gen_grp[[p]], index = "invsimpson") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

loc_stats[[p]] <- left_join(temp, locC)

}

loc_stats <- ldply(loc_stats, data.frame) %>%
  select(-Hexp) %>%
  rename(GRP = `.id`,
         SIMPSON_IDX = `X1.D`,
         N_ALLELES = allele,
         SHANNON_IDX = H,
         STODD_TAYLOR_IDX = G,
         EVENNESS = Evenness)

# calculate genetic diversity stats (heterozygosity-based) ====
loc_stats_2 <- list()

for (p in names(gen_grp)) {

dat <- hierfstat:::.genind2hierfstat(gen_grp[[p]])
stats <- basic.stats(dat)

loc_stats_2[[p]] <- stats$perloc %>%
  rownames_to_column("LOCUS")

}

# combine into single data frame ====
loc_stats_2 <- ldply(loc_stats_2, data.frame) %>%
  rename(GRP = `.id`)

loc_stats <- left_join(loc_stats, loc_stats_2) %>%
  select(GRP, LOCUS, N_ALLELES, EVENNESS, Ho, Hs, Ht, Fis, SHANNON_IDX, SIMPSON_IDX, STODD_TAYLOR_IDX) %>%
  filter(LOCUS != "mean")

loc_stats[is.na(loc_stats)] <- NA

write_delim(loc_stats, "results/gendiv.locstats", delim = "\t")

3.2 Observed heterozygosity

The observed heterozygosity (Ho) is measured as the proportion of heterozygote genotypes per locus (Nei 1987).

Compare distribution of observed heterozygosity within and among runs

Test of significant differences among runs using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  Ho and GRP and LOCUS
## Friedman chi-squared = 3891.6, df = 16, p-value < 0.00000000000000022

Test for significant pairwise differences between tributaries within runs using Wilcoxon signed rank test.

# remove loci with Na values
rm <- het %>%
  filter(is.na(Ho))

het <- het %>%
  filter(!LOCUS %in% rm$LOCUS)

# groups to compare
comp <- as.character(unique(het$GRP))

# pairs of comparisons
pairs <- expand.grid(comp, comp) %>%
  filter(!Var1 == Var2) %>%
  rownames_to_column("PAIR") %>%
  split(.$PAIR) %>%
  purrr::map(function(x){
    x %>%
      select(-PAIR) %>%
      gather(key = temp, value = GRP, 1:2) %>%
      select(-temp)
  })

# empty data frame for results
results <- setNames(data.frame(matrix(ncol = 4, nrow = 0)), 
                    c("pop1", "pop2", "stat", "p.value")) %>%
  mutate(pop1 = as.character(pop1),
         pop2 = as.character(pop2),
         stat = as.numeric(stat),
         p.value = as.numeric(p.value))

n <- as.numeric(length(pairs))

# loop over pairs
for(i in 1:length(pairs)){

  p <- i

  pair <- pairs[[p]]$GRP

  temp <- het %>%
    filter(GRP %in% pair) %>%
    mutate(GRP = ordered(GRP, levels = pair),
         LOCUS = as.factor(LOCUS)) %>%
  droplevels()

  wilcox <- wilcoxsign_test(Ho ~ GRP | LOCUS,
                data = temp,
                zero.method = "Pratt")

  df <- data.frame("pop1" = pair[1],
                   "pop2" = pair[2],
                   "stat" = as.numeric(wilcox@statistic@teststatistic),
                   "p-value" = as.numeric(pvalue(wilcox)))

  results <- bind_rows(results, df)

}

results <- results %>%
  mutate(pop1 = ordered(pop1, levels = trib_runs),
         pop2 = ordered(pop2, levels = trib_runs))

write_delim(results, "results/obshet_run_trib.wilcox", delim = "\t")

Identify pairwise significantly different distribution between run/trib groups.

Results pairwise comparisons of observed heterozygosity among tributaries within runs using Wilcoxon signed rank test.

Results pairwise comparisons of observed heterozygosity among tributaries within runs using Wilcoxon signed rank test.

Compare distributions.

Distribution of observed heterozygosity of individuals grouped by tributary within each run.

Distribution of observed heterozygosity of individuals grouped by tributary within each run.

Significantly different pairwise comparisons of observedheterozygosity across loci for individuals grouped by tributaries withinruns.

pop1 pop2 stat p.value p_adj
F_FRH W_USR 39.1048 0.0000 0.0000
F_COL W_USR 38.6791 0.0000 0.0000
F_MER W_USR 35.3954 0.0000 0.0000
F_NIM W_USR 35.3437 0.0000 0.0000
S_DER W_USR 35.3418 0.0000 0.0000
F_MKH W_USR 35.1831 0.0000 0.0000
F_MIL W_USR 34.3626 0.0000 0.0000
F_STN W_USR 34.3477 0.0000 0.0000
F_TOU W_USR 34.3074 0.0000 0.0000
L_USR W_USR 33.1392 0.0000 0.0000
F_BUT W_USR 32.4029 0.0000 0.0000
F_DER W_USR 31.6226 0.0000 0.0000
S_MIL W_USR 29.6530 0.0000 0.0000
F_MRH W_USR 29.3136 0.0000 0.0000
S_BUT W_USR 26.4736 0.0000 0.0000
S_FRH W_USR 23.0823 0.0000 0.0000
F_FRH S_BUT 16.1773 0.0000 0.0000
F_COL S_BUT 14.3671 0.0000 0.0000
F_MIL S_BUT 11.4354 0.0000 0.0000
F_MKH S_BUT 11.3978 0.0000 0.0000
F_FRH F_BUT 11.3174 0.0000 0.0000
F_NIM S_BUT 11.2935 0.0000 0.0000
F_FRH S_MIL 11.2336 0.0000 0.0000
F_FRH S_FRH 10.8857 0.0000 0.0000
F_FRH L_USR 10.6817 0.0000 0.0000
F_FRH F_MRH 10.6776 0.0000 0.0000
F_MER S_BUT 10.5184 0.0000 0.0000
F_FRH F_TOU 10.3985 0.0000 0.0000
F_FRH F_MER 10.0416 0.0000 0.0000
F_FRH F_DER 9.5463 0.0000 0.0000
F_TOU S_BUT 9.5314 0.0000 0.0000
F_COL F_BUT 9.2579 0.0000 0.0000
F_FRH S_DER 8.8955 0.0000 0.0000
F_FRH F_NIM 8.7359 0.0000 0.0000
F_COL L_USR 8.5759 0.0000 0.0000
F_FRH F_MKH 8.5154 0.0000 0.0000
F_DER S_BUT 8.4337 0.0000 0.0000
F_COL S_FRH 8.3184 0.0000 0.0000
F_COL S_MIL 8.0711 0.0000 0.0000
S_DER S_BUT 7.6913 0.0000 0.0000
F_STN S_BUT 7.6438 0.0000 0.0000
F_FRH F_STN 7.5029 0.0000 0.0000
F_COL S_DER 7.2397 0.0000 0.0000
F_COL F_MRH 7.1334 0.0000 0.0000
F_MIL F_BUT 7.0241 0.0000 0.0000
F_MKH S_FRH 6.8937 0.0000 0.0000
S_MIL S_BUT 6.7613 0.0000 0.0000
F_MRH S_BUT 6.7527 0.0000 0.0000
F_NIM F_BUT 6.4274 0.0000 0.0000
F_MKH S_MIL 6.2274 0.0000 0.0000
F_BUT S_BUT 6.0031 0.0000 0.0000
F_MKH F_BUT 6.0024 0.0000 0.0000
F_MIL L_USR 5.9055 0.0000 0.0000
F_NIM S_FRH 5.8448 0.0000 0.0000
F_NIM S_MIL 5.5633 0.0000 0.0000
F_COL F_DER 5.5048 0.0000 0.0000
F_COL F_STN 5.4600 0.0000 0.0000
L_USR S_BUT 5.2146 0.0000 0.0000
F_COL F_TOU 5.1889 0.0000 0.0000
F_NIM L_USR 5.1206 0.0000 0.0000
F_FRH F_MIL 5.0978 0.0000 0.0000
F_FRH F_COL 5.0077 0.0000 0.0000
F_MER S_MIL 4.9342 0.0000 0.0000
F_MER S_FRH 4.8220 0.0000 0.0000
F_NIM F_MRH 4.8102 0.0000 0.0000
F_MIL S_FRH 4.8086 0.0000 0.0000
F_MIL S_DER 4.7729 0.0000 0.0000
F_MKH L_USR 4.6296 0.0000 0.0000
F_TOU S_FRH 4.5094 0.0000 0.0000
F_MIL F_STN 4.3115 0.0000 0.0000
F_NIM S_DER 4.3009 0.0000 0.0000
F_COL F_MER 4.1906 0.0000 0.0000
F_MER F_BUT 4.1713 0.0000 0.0000
F_MKH F_MRH 4.0767 0.0000 0.0000
F_MKH F_DER 3.9318 0.0001 0.0002
F_TOU S_MIL 3.9263 0.0001 0.0002
F_STN S_FRH 3.9094 0.0001 0.0002
F_MIL S_MIL 3.8983 0.0001 0.0002
F_MER L_USR 3.8340 0.0001 0.0002
F_DER F_BUT 3.8071 0.0001 0.0002
F_MKH F_MER 3.7969 0.0001 0.0002
F_STN S_DER 3.6471 0.0003 0.0005
F_COL F_MIL 3.5339 0.0004 0.0007
F_MKH S_DER 3.4436 0.0006 0.0010
L_USR S_FRH 3.3424 0.0008 0.0013
F_NIM F_DER 3.2529 0.0011 0.0017
F_MIL F_MRH 3.2057 0.0013 0.0020
S_FRH S_BUT 3.2008 0.0014 0.0022
F_DER S_MIL 3.0998 0.0019 0.0029
F_MER F_MRH 2.9651 0.0030 0.0045
F_MKH F_TOU 2.9566 0.0031 0.0046
F_MER S_DER 2.9367 0.0033 0.0049
F_TOU F_BUT 2.9209 0.0035 0.0051
F_MIL F_TOU 2.7212 0.0065 0.0094
F_DER L_USR 2.6895 0.0072 0.0103
S_DER S_FRH 2.6374 0.0084 0.0119
F_STN S_MIL 2.6302 0.0085 0.0119
F_STN L_USR 2.6288 0.0086 0.0119
F_NIM F_MER 2.5304 0.0114 0.0157
F_MRH S_MIL 2.5200 0.0117 0.0159
F_NIM F_STN 2.4878 0.0129 0.0174
F_MKH F_NIM 2.4789 0.0132 0.0176
F_TOU F_MRH 2.3624 0.0182 0.0240
F_BUT S_FRH 2.2552 0.0241 0.0315
F_BUT L_USR 2.2400 0.0251 0.0325
F_STN F_MRH 2.2356 0.0254 0.0326
F_TOU L_USR 2.2095 0.0271 0.0344

Table S3: Mean, standard deviation, median, 25th, and 75thquartile of observed heterozygosity across loci for individuals groupedby run type & tributary.

GRP mean sd median Q1 Q3
F_COL 0.1713 0.1886 0.1000 0 0.3000
F_MIL 0.1699 0.1922 0.1000 0 0.2941
F_DER 0.1674 0.1939 0.0714 0 0.2857
F_BUT 0.1641 0.1884 0.1000 0 0.2778
F_FRH 0.1741 0.1908 0.1111 0 0.2963
F_NIM 0.1688 0.1874 0.1000 0 0.3000
F_MKH 0.1689 0.1884 0.0800 0 0.2917
F_STN 0.1673 0.1905 0.0909 0 0.2857
F_TOU 0.1659 0.1849 0.0833 0 0.2857
F_MRH 0.1658 0.1938 0.0769 0 0.2857
F_MER 0.1670 0.1846 0.1000 0 0.2903
L_USR 0.1654 0.1888 0.0952 0 0.2857
W_USR 0.1272 0.1840 0.0385 0 0.2308
S_MIL 0.1651 0.1932 0.0714 0 0.2857
S_DER 0.1628 0.1802 0.0870 0 0.2800
S_BUT 0.1587 0.1888 0.0588 0 0.2778
S_FRH 0.1695 0.2157 0.1429 0 0.2857

3.3 Expected heterozygosity

Expected heterozygosity Hs is calculated as the proportion of heterozygous genotypes expected under Hardy-Weinberg equilibrium (Nei 1987).

Compare distribution of gene diversity within and among runs

Test of significant differences among runs using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  Hs and GRP and LOCUS
## Friedman chi-squared = 3421.1, df = 16, p-value < 0.00000000000000022

Test for significant pairwise differences between tributaries within runs using Wilcoxon signed rank test.

Results pairwise comparisons of expected heterozygosity among tributaries within runs using Wilcoxon signed rank test.

Results pairwise comparisons of expected heterozygosity among tributaries within runs using Wilcoxon signed rank test.

Compare distributions across groups.

Distribution of expected heterozygosity per locus for individuals grouped by tributary within each run.

Distribution of expected heterozygosity per locus for individuals grouped by tributary within each run.

Significance of pairwise comparisons of expected heterozygosityfor individuals grouped by tributaries within runs.

pop1 pop2 stat p.value p_adj
F_FRH W_USR 39.2049 0.0000 0.0000
S_DER W_USR 39.1176 0.0000 0.0000
F_COL W_USR 38.1169 0.0000 0.0000
F_MIL W_USR 36.5044 0.0000 0.0000
F_MER W_USR 35.9299 0.0000 0.0000
F_NIM W_USR 35.4636 0.0000 0.0000
F_TOU W_USR 35.2467 0.0000 0.0000
F_STN W_USR 35.0595 0.0000 0.0000
F_MKH W_USR 35.0166 0.0000 0.0000
L_USR W_USR 34.4389 0.0000 0.0000
F_BUT W_USR 33.8632 0.0000 0.0000
S_MIL W_USR 33.7289 0.0000 0.0000
F_DER W_USR 33.4003 0.0000 0.0000
F_MRH W_USR 32.8072 0.0000 0.0000
S_BUT W_USR 29.6358 0.0000 0.0000
S_FRH W_USR 25.8343 0.0000 0.0000
F_FRH S_BUT 10.7653 0.0000 0.0000
F_MIL S_BUT 9.3655 0.0000 0.0000
S_DER S_BUT 9.2576 0.0000 0.0000
F_COL S_BUT 9.1696 0.0000 0.0000
F_FRH F_MER 8.8120 0.0000 0.0000
F_FRH F_NIM 8.5196 0.0000 0.0000
F_FRH F_MKH 8.2886 0.0000 0.0000
F_MIL F_BUT 8.0117 0.0000 0.0000
F_FRH F_BUT 7.8149 0.0000 0.0000
S_MIL S_BUT 7.6811 0.0000 0.0000
F_FRH L_USR 7.6379 0.0000 0.0000
F_MIL L_USR 7.5210 0.0000 0.0000
F_FRH F_TOU 7.4032 0.0000 0.0000
F_MIL F_STN 7.1562 0.0000 0.0000
F_NIM S_BUT 7.0782 0.0000 0.0000
F_MER S_BUT 6.8788 0.0000 0.0000
F_MRH S_BUT 6.6184 0.0000 0.0000
F_MKH S_BUT 6.4468 0.0000 0.0000
F_DER S_BUT 6.3022 0.0000 0.0000
F_TOU S_BUT 6.2453 0.0000 0.0000
F_FRH F_DER 6.2127 0.0000 0.0000
F_FRH F_STN 6.0891 0.0000 0.0000
F_FRH S_FRH 5.9649 0.0000 0.0000
F_COL L_USR 5.6659 0.0000 0.0000
F_FRH S_MIL 5.4279 0.0000 0.0000
F_FRH F_COL 5.3574 0.0000 0.0000
F_MIL F_TOU 5.0770 0.0000 0.0000
F_COL F_BUT 4.7236 0.0000 0.0000
F_MIL F_MKH 4.7030 0.0000 0.0000
F_MIL F_MER 4.5026 0.0000 0.0000
S_DER F_TOU 4.4197 0.0000 0.0000
F_MIL F_NIM 4.3035 0.0000 0.0000
S_DER S_FRH 4.2387 0.0000 0.0000
F_BUT S_BUT 4.2346 0.0000 0.0000
F_FRH F_MRH 4.1109 0.0000 0.0000
S_DER F_MKH 4.0649 0.0000 0.0000
F_MIL S_FRH 3.8499 0.0001 0.0003
F_STN S_BUT 3.7820 0.0002 0.0005
F_MRH F_BUT 3.7347 0.0002 0.0005
S_DER F_MER 3.7025 0.0002 0.0005
F_COL S_FRH 3.6908 0.0002 0.0005
F_DER L_USR 3.5925 0.0003 0.0007
S_DER F_BUT 3.5784 0.0003 0.0007
S_DER F_NIM 3.4957 0.0005 0.0011
F_COL F_STN 3.4423 0.0006 0.0013
F_DER F_BUT 3.3728 0.0007 0.0015
F_MRH F_DER 3.2570 0.0011 0.0024
F_MRH L_USR 3.1089 0.0019 0.0040
S_DER L_USR 2.9818 0.0029 0.0061
L_USR S_BUT 2.8755 0.0040 0.0082
F_MKH F_MER 2.7790 0.0055 0.0112
F_MKH F_NIM 2.7182 0.0066 0.0132
S_MIL F_BUT 2.7065 0.0068 0.0134
F_BUT L_USR 2.6942 0.0071 0.0136
F_MIL F_DER 2.6926 0.0071 0.0136
F_NIM L_USR 2.6676 0.0076 0.0144
S_DER F_DER 2.6247 0.0087 0.0162
S_MIL L_USR 2.5775 0.0100 0.0184
F_MKH L_USR 2.5647 0.0103 0.0187
F_MER L_USR 2.5351 0.0112 0.0200
F_COL F_MER 2.4938 0.0126 0.0221
S_DER F_STN 2.4919 0.0127 0.0221
F_NIM F_BUT 2.4844 0.0130 0.0224
F_MER F_BUT 2.4674 0.0136 0.0231
S_FRH S_BUT 2.3866 0.0170 0.0285
F_NIM S_FRH 2.2779 0.0227 0.0376
F_MKH F_BUT 2.2373 0.0253 0.0415
S_DER S_MIL 2.1591 0.0308 0.0499

Table S4: Mean, standard deviation, median, 25th, and 75thquartile of observed heterozygosity per locus for individuals grouped byrun type & tributary.

GRP mean sd median Q1 Q3
F_COL 0.1693 0.1798 0.0966 0 0.3034
F_MIL 0.1714 0.1845 0.1078 0 0.3125
F_DER 0.1685 0.1850 0.0769 0 0.3022
F_BUT 0.1662 0.1827 0.0974 0 0.2974
F_FRH 0.1718 0.1810 0.1068 0 0.3063
F_NIM 0.1678 0.1804 0.0966 0 0.3034
F_MKH 0.1680 0.1810 0.1032 0 0.3003
F_STN 0.1670 0.1827 0.0929 0 0.3030
F_TOU 0.1676 0.1807 0.0998 0 0.3014
F_MRH 0.1695 0.1870 0.0833 0 0.3077
F_MER 0.1676 0.1799 0.0966 0 0.2960
L_USR 0.1668 0.1831 0.0929 0 0.3000
W_USR 0.1285 0.1789 0.0385 0 0.2369
S_MIL 0.1695 0.1877 0.0769 0 0.3125
S_DER 0.1699 0.1808 0.1068 0 0.3030
S_BUT 0.1642 0.1874 0.1023 0 0.3070
S_FRH 0.1695 0.1983 0.1429 0 0.3571

3.4 Inbreeding coefficient (Fis)

The inbreeding coefficient Fis is calculated as 1-Ho/He (Weir and Cockerham 1984) and indicates heterozygote deficiencies compared to expectations under Hardy-Weinberg Equilibrium, i.e. it is a measure of deviation from panmixia at local scales.

Compare distribution of heterozygosity deficiency (Fis) within and among runs

Test of significant differences among runs using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  Fis and GRP and LOCUS
## Friedman chi-squared = 382.82, df = 16, p-value < 0.00000000000000022

Test for significant pairwise differences between tributaries within runs using Wilcoxon signed rank test.

Results pairwise comparisons of Fis among tributaries within runs using Wilcoxon signed rank test.

Results pairwise comparisons of Fis among tributaries within runs using Wilcoxon signed rank test.

Compare distributions across groups.

Distribution of Fis per locus for individuals grouped by tributary within each run.

Distribution of Fis per locus for individuals grouped by tributary within each run.

Significance of pairwise comparisons of Fis for individualsgrouped by tributaries within runs.

pop1 pop2 stat p.value p_adj
S_DER F_FRH 12.9986 0.0000 0.0000
S_DER F_COL 12.3879 0.0000 0.0000
S_DER F_MKH 11.6782 0.0000 0.0000
S_DER F_NIM 11.2995 0.0000 0.0000
S_DER S_FRH 10.4531 0.0000 0.0000
S_DER F_STN 9.7002 0.0000 0.0000
S_BUT F_FRH 9.3555 0.0000 0.0000
S_DER F_DER 8.9582 0.0000 0.0000
S_DER F_MER 8.9367 0.0000 0.0000
S_BUT F_COL 8.8651 0.0000 0.0000
S_DER F_MIL 8.5151 0.0000 0.0000
S_BUT S_FRH 8.5124 0.0000 0.0000
S_DER L_USR 8.0760 0.0000 0.0000
S_BUT F_NIM 7.9226 0.0000 0.0000
S_BUT F_MKH 7.9064 0.0000 0.0000
S_DER F_TOU 7.1567 0.0000 0.0000
W_USR S_FRH 7.0839 0.0000 0.0000
W_USR F_FRH 6.8825 0.0000 0.0000
S_BUT F_STN 6.8440 0.0000 0.0000
S_MIL F_FRH 6.7485 0.0000 0.0000
S_DER F_BUT 6.5330 0.0000 0.0000
W_USR F_COL 6.5018 0.0000 0.0000
S_BUT F_DER 6.2774 0.0000 0.0000
F_TOU F_FRH 6.2393 0.0000 0.0000
S_MIL S_FRH 6.1885 0.0000 0.0000
S_MIL F_COL 6.1795 0.0000 0.0000
F_TOU S_FRH 5.9771 0.0000 0.0000
S_DER W_USR 5.9011 0.0000 0.0000
S_BUT F_MIL 5.8820 0.0000 0.0000
F_TOU F_COL 5.8464 0.0000 0.0000
S_BUT F_MER 5.8370 0.0000 0.0000
S_DER F_MRH 5.7265 0.0000 0.0000
F_MRH S_FRH 5.5541 0.0000 0.0000
F_BUT S_FRH 5.4888 0.0000 0.0000
S_MIL F_MKH 5.3953 0.0000 0.0000
F_MRH F_FRH 5.3218 0.0000 0.0000
F_BUT F_FRH 5.2537 0.0000 0.0000
W_USR F_NIM 5.0250 0.0000 0.0000
F_BUT F_COL 5.0042 0.0000 0.0000
L_USR F_FRH 4.9626 0.0000 0.0000
S_MIL F_NIM 4.8518 0.0000 0.0000
W_USR F_MKH 4.7828 0.0000 0.0000
S_BUT L_USR 4.7529 0.0000 0.0000
F_MRH F_COL 4.7432 0.0000 0.0000
L_USR S_FRH 4.6219 0.0000 0.0000
S_MIL F_DER 4.4091 0.0000 0.0000
F_MER S_FRH 4.3662 0.0000 0.0000
F_BUT F_MKH 4.3205 0.0000 0.0000
S_MIL F_STN 4.2781 0.0000 0.0000
F_TOU F_MKH 4.2429 0.0000 0.0000
F_MRH F_MKH 4.2335 0.0000 0.0000
F_MIL S_FRH 4.2321 0.0000 0.0000
W_USR F_STN 4.1502 0.0000 0.0000
L_USR F_COL 4.1256 0.0000 0.0000
F_TOU F_NIM 4.0931 0.0000 0.0000
S_BUT F_TOU 3.9108 0.0001 0.0002
S_DER S_MIL 3.8990 0.0001 0.0002
S_BUT F_BUT 3.8714 0.0001 0.0002
F_MER F_FRH 3.8639 0.0001 0.0002
W_USR F_DER 3.6270 0.0003 0.0006
S_MIL F_MIL 3.6189 0.0003 0.0006
S_MIL F_MER 3.6182 0.0003 0.0006
F_NIM S_FRH 3.5938 0.0003 0.0006
F_BUT F_NIM 3.5459 0.0004 0.0008
S_BUT F_MRH 3.5442 0.0004 0.0008
F_MER F_COL 3.5146 0.0004 0.0008
F_MRH F_NIM 3.5124 0.0004 0.0008
F_STN S_FRH 3.4311 0.0006 0.0012
F_DER S_FRH 3.3807 0.0007 0.0014
F_TOU F_STN 3.3040 0.0010 0.0019
F_TOU F_DER 3.1047 0.0019 0.0036
F_MRH F_DER 3.0935 0.0020 0.0038
W_USR F_MIL 3.0501 0.0023 0.0043
F_MRH F_STN 2.9900 0.0028 0.0051
W_USR F_MER 2.9355 0.0033 0.0060
S_DER S_BUT 2.9150 0.0036 0.0064
F_MIL F_FRH 2.8198 0.0048 0.0085
L_USR F_NIM 2.7748 0.0055 0.0096
L_USR F_MKH 2.7127 0.0067 0.0115
S_MIL L_USR 2.6934 0.0071 0.0121
F_BUT F_STN 2.6779 0.0074 0.0124
F_MIL F_COL 2.6583 0.0079 0.0131
F_MKH S_FRH 2.6466 0.0081 0.0133
S_BUT W_USR 2.5876 0.0097 0.0157
F_NIM F_FRH 2.5694 0.0102 0.0163
F_STN F_FRH 2.5558 0.0106 0.0168
F_MER F_MKH 2.5383 0.0111 0.0173
F_BUT F_DER 2.5360 0.0112 0.0173
F_DER F_FRH 2.4110 0.0159 0.0243
F_TOU F_MER 2.2663 0.0234 0.0354
F_MRH F_MIL 2.2467 0.0247 0.0369
L_USR F_DER 2.1934 0.0283 0.0414
F_BUT F_MIL 2.1930 0.0283 0.0414
F_COL S_FRH 2.1666 0.0303 0.0438
L_USR F_STN 2.1160 0.0343 0.0491

Pairwise comparisons of Fis that are not significantlydifferent for individuals grouped by tributaries within runs.

pop1 pop2 stat p.value p_adj
F_BUT F_MER 2.0850 0.0371 0.0526
F_MRH F_MER 2.0742 0.0381 0.0534
F_TOU F_MIL 2.0110 0.0443 0.0615
S_MIL F_TOU 1.9924 0.0463 0.0636
W_USR L_USR 1.9798 0.0477 0.0649
F_DER F_COL 1.9702 0.0488 0.0657
F_MKH F_FRH 1.9402 0.0524 0.0699
F_FRH S_FRH 1.7935 0.0729 0.0963
F_STN F_COL 1.7595 0.0785 0.1027
F_NIM F_COL 1.7229 0.0849 0.1100
F_MIL F_MKH 1.7087 0.0875 0.1123
S_MIL F_BUT 1.6997 0.0892 0.1134
S_BUT S_MIL 1.6109 0.1072 0.1350
F_MER F_NIM 1.6052 0.1084 0.1353
W_USR F_BUT 1.4768 0.1397 0.1727
F_MRH L_USR 1.3439 0.1790 0.2193
S_MIL F_MRH 1.2944 0.1955 0.2374
F_MKH F_COL 1.2105 0.2261 0.2721
F_MIL F_NIM 1.1771 0.2392 0.2854
F_MER F_DER 1.0953 0.2734 0.3233
F_TOU L_USR 1.0781 0.2810 0.3294
F_MRH F_BUT 0.9302 0.3523 0.4095
L_USR F_MIL 0.9056 0.3652 0.4209
F_MER F_STN 0.8594 0.3901 0.4458
F_DER F_MKH 0.8017 0.4227 0.4791
W_USR F_TOU 0.7884 0.4305 0.4835
F_NIM F_MKH 0.7824 0.4340 0.4835
S_MIL W_USR 0.7767 0.4373 0.4835
F_STN F_MKH 0.7604 0.4470 0.4888
L_USR F_MER 0.7566 0.4493 0.4888
F_STN F_NIM 0.7182 0.4726 0.5101
W_USR F_MRH 0.7049 0.4809 0.5150
F_MIL F_STN 0.6580 0.5106 0.5425
F_BUT L_USR 0.5643 0.5725 0.6036
F_MER F_MIL 0.5042 0.6141 0.6424
F_DER F_NIM 0.4850 0.6277 0.6517
F_MIL F_DER 0.3097 0.7568 0.7797
F_COL F_FRH 0.2366 0.8130 0.8313
F_MRH F_TOU 0.1503 0.8805 0.8936
F_TOU F_BUT 0.1394 0.8891 0.8957
F_DER F_STN 0.0867 0.9309 0.9309

Table S5: Mean, standard deviation, median, 25th, and 75thquartile of Fis per locus for individuals grouped by run type &tributary.

GRP mean sd median Q1 Q3
F_COL -0.0171 0.1823 -0.0404 -0.1373 0.1002
F_MIL -0.0010 0.2328 -0.0370 -0.1515 0.1387
F_DER -0.0020 0.2470 -0.0400 -0.1667 0.1340
F_BUT 0.0077 0.2217 -0.0286 -0.1333 0.1461
F_FRH -0.0194 0.1889 -0.0417 -0.1429 0.0979
F_NIM -0.0100 0.1847 -0.0370 -0.1277 0.1018
F_MKH -0.0131 0.1835 -0.0400 -0.1304 0.0983
F_STN -0.0060 0.2100 -0.0471 -0.1429 0.1220
F_TOU 0.0064 0.1941 -0.0216 -0.1212 0.1253
F_MRH 0.0149 0.2630 -0.0400 -0.1507 0.1724
F_MER -0.0038 0.1820 -0.0300 -0.1200 0.1064
L_USR 0.0029 0.2144 -0.0270 -0.1429 0.1411
W_USR 0.0104 0.1989 -0.0204 -0.0870 0.0964
S_MIL 0.0205 0.2587 -0.0345 -0.1526 0.1875
S_DER 0.0370 0.2085 0.0000 -0.1064 0.1732
S_BUT 0.0273 0.2273 -0.0149 -0.1250 0.1628
S_FRH -0.0141 0.3251 -0.0909 -0.2000 0.1429

3.5 Comparison of allelic richness

The number of alleles observed per locus is not independent from sample size, especially for small samples sizes; with an increasing number of individuals sampled an increasing number of alleles is expected to be identified as rare alleles are more likely to be observed only for larger sample sizes. Differences in sample size can be adjusted for using rarefaction, i.e. by determining the lowest sample size and then subsampling x individuals from each group and counting alleles and using the mean number as the allele count.

Compare differences in rarefied allele counts among tributaries within runs

Calculate allele count per locus corrected for differences in sample size using rarefaction.

# by tributary within run
setPop(gen) <- ~RUN_LOC

# calculate allele counts using rarefaction
dat <- genind2hierfstat(gen)

df <- allelic.richness(dat,
                       diploid = TRUE)

df <- as.data.frame(df$Ar) %>%
  rownames_to_column("LOCUS")

colnames(df) <- c("LOCUS", trib_runs)

write_delim(df, "results/trib_run_rarefied.allelecount", delim = "\t")

Test of significant differences among runs using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  AR and pop and LOCUS
## Friedman chi-squared = 4236.7, df = 16, p-value < 0.00000000000000022

Test for significant pairwise differences between tributaries within runs using Wilcoxon signed rank test.

Significance of pairwise comparison of rarefied allele counts among runs using Wilcoxon signed rank test. Tile color indicates p-value, label is stat (indicates direction of difference); p-values were adjusted for multiple comparisons using Benjamini-Hochberg.

Significance of pairwise comparison of rarefied allele counts among runs using Wilcoxon signed rank test. Tile color indicates p-value, label is stat (indicates direction of difference); p-values were adjusted for multiple comparisons using Benjamini-Hochberg.

Compare distributions of values.

Distribution of rarefied allele counts per locus for individuals grouped by run type within tributaries.

Distribution of rarefied allele counts per locus for individuals grouped by run type within tributaries.

Significance of pairwise comparisons of allele counts per locusfor individuals grouped by run type within tributaries.

pop1 pop2 stat p.value p_adj
F_MIL L_USR 47.3156 0.0000 0.0000
W_USR L_USR 46.7328 0.0000 0.0000
F_MRH L_USR 46.4299 0.0000 0.0000
F_FRH L_USR 44.2771 0.0000 0.0000
F_STN L_USR 43.5133 0.0000 0.0000
F_MKH L_USR 43.0950 0.0000 0.0000
F_TOU L_USR 42.7212 0.0000 0.0000
F_COL L_USR 42.6343 0.0000 0.0000
F_MER L_USR 42.3246 0.0000 0.0000
F_NIM L_USR 41.3909 0.0000 0.0000
S_FRH L_USR 41.3626 0.0000 0.0000
F_BUT L_USR 41.2526 0.0000 0.0000
S_MIL L_USR 40.7391 0.0000 0.0000
F_DER L_USR 40.5246 0.0000 0.0000
S_BUT L_USR 35.6730 0.0000 0.0000
S_DER L_USR 34.6154 0.0000 0.0000
F_MIL S_DER 14.2654 0.0000 0.0000
F_FRH S_DER 13.1600 0.0000 0.0000
F_MRH S_DER 12.9707 0.0000 0.0000
S_BUT S_DER 11.9349 0.0000 0.0000
W_USR S_DER 11.6031 0.0000 0.0000
F_TOU S_DER 10.5659 0.0000 0.0000
S_MIL S_DER 10.5214 0.0000 0.0000
F_BUT S_DER 10.3866 0.0000 0.0000
F_DER S_DER 10.3305 0.0000 0.0000
F_MIL F_STN 10.2502 0.0000 0.0000
F_STN S_DER 10.2064 0.0000 0.0000
F_FRH S_FRH 9.6038 0.0000 0.0000
F_MIL F_TOU 9.4725 0.0000 0.0000
F_COL S_DER 9.3000 0.0000 0.0000
F_MIL F_COL 9.2255 0.0000 0.0000
F_MKH S_DER 9.1481 0.0000 0.0000
F_FRH F_MER 8.6395 0.0000 0.0000
F_MIL S_FRH 8.4771 0.0000 0.0000
F_FRH F_NIM 8.4586 0.0000 0.0000
F_MIL F_MKH 8.1921 0.0000 0.0000
S_BUT S_FRH 8.0404 0.0000 0.0000
F_NIM S_DER 7.3800 0.0000 0.0000
F_MER S_DER 7.1594 0.0000 0.0000
F_FRH F_MKH 7.1041 0.0000 0.0000
F_MIL F_NIM 6.9014 0.0000 0.0000
S_BUT F_NIM 6.8211 0.0000 0.0000
F_FRH F_STN 6.8165 0.0000 0.0000
F_MRH S_FRH 6.7512 0.0000 0.0000
S_BUT F_MER 6.5489 0.0000 0.0000
S_BUT S_MIL 6.5485 0.0000 0.0000
F_FRH F_COL 6.5399 0.0000 0.0000
S_BUT F_DER 6.4543 0.0000 0.0000
F_FRH F_TOU 6.4394 0.0000 0.0000
F_BUT S_FRH 6.3824 0.0000 0.0000
S_BUT F_BUT 6.0932 0.0000 0.0000
F_MIL F_MER 6.0609 0.0000 0.0000
F_MIL F_MRH 6.0357 0.0000 0.0000
F_MIL S_MIL 5.6104 0.0000 0.0000
F_DER S_FRH 5.3952 0.0000 0.0000
F_DER F_NIM 5.3912 0.0000 0.0000
F_BUT F_NIM 5.2145 0.0000 0.0000
S_FRH S_DER 5.0933 0.0000 0.0000
F_DER F_COL 4.7337 0.0000 0.0000
S_BUT F_COL 4.6233 0.0000 0.0000
F_DER F_STN 4.5255 0.0000 0.0000
S_BUT F_MKH 4.4261 0.0000 0.0000
W_USR F_MKH 4.4182 0.0000 0.0000
W_USR F_COL 4.3170 0.0000 0.0000
F_DER F_MKH 4.1916 0.0000 0.0000
F_MRH F_NIM 4.1878 0.0000 0.0000
F_DER F_MER 4.1412 0.0000 0.0000
S_BUT F_STN 4.0940 0.0000 0.0000
F_NIM S_FRH 4.0116 0.0001 0.0002
W_USR F_TOU 3.8267 0.0001 0.0002
W_USR F_STN 3.8196 0.0001 0.0002
F_MIL F_BUT 3.7908 0.0002 0.0004
F_BUT F_MER 3.7600 0.0002 0.0004
F_DER F_TOU 3.6623 0.0002 0.0004
W_USR S_FRH 3.6614 0.0003 0.0005
F_MRH F_MER 3.6446 0.0003 0.0005
F_MRH F_STN 3.5797 0.0003 0.0005
F_BUT F_STN 3.5178 0.0004 0.0007
S_BUT F_TOU 3.4181 0.0006 0.0010
F_FRH F_MRH 3.3930 0.0007 0.0012
F_BUT F_MKH 3.3042 0.0010 0.0017
F_BUT F_TOU 3.1003 0.0019 0.0032
S_MIL S_FRH 3.0946 0.0020 0.0033
F_DER S_MIL 3.0470 0.0023 0.0037
F_BUT F_COL 2.9286 0.0034 0.0054
F_TOU S_FRH 2.9152 0.0036 0.0057
F_COL F_STN 2.8340 0.0046 0.0072
F_COL F_TOU 2.8103 0.0049 0.0076
S_BUT F_FRH 2.7886 0.0053 0.0081
F_STN S_FRH 2.7431 0.0061 0.0092
F_DER F_BUT 2.7099 0.0067 0.0100
W_USR F_NIM 2.7077 0.0068 0.0101
F_FRH W_USR 2.6747 0.0075 0.0110
F_COL S_FRH 2.4104 0.0159 0.0230
F_NIM F_MER 2.3342 0.0196 0.0281
S_MIL F_NIM 2.2807 0.0226 0.0320
F_FRH S_MIL 2.2315 0.0256 0.0359
F_MKH S_FRH 2.1175 0.0342 0.0475
S_MIL F_MER 2.0975 0.0360 0.0495

Table S6: Mean, standard deviation, median, 25th, and 75thquartile of allelic richness across loci for individuals grouped by runtype & tributary.

pop mean sd median Q1 Q3
F_COL 1.51 0.48 1.45 1 1.91
F_MIL 1.52 0.48 1.47 1 1.92
F_DER 1.51 0.50 1.42 1 1.93
F_BUT 1.51 0.49 1.38 1 1.93
F_FRH 1.52 0.49 1.48 1 1.93
F_NIM 1.51 0.49 1.44 1 1.91
F_MKH 1.51 0.48 1.44 1 1.91
F_STN 1.51 0.48 1.43 1 1.90
F_TOU 1.51 0.48 1.43 1 1.91
F_MRH 1.52 0.48 1.43 1 1.91
F_MER 1.51 0.48 1.42 1 1.91
L_USR 1.36 0.45 1.19 1 1.80
W_USR 1.52 0.48 1.47 1 1.91
S_MIL 1.51 0.50 1.38 1 1.93
S_DER 1.48 0.50 1.46 1 1.92
S_BUT 1.51 0.55 1.71 1 1.99
S_FRH 1.50 0.49 1.42 1 1.91

3.6 Evenness of allelic diversity

The Evenness of allelic diversity at a given locus is calculated as the ratio of the number of abundant genotypes to the number of rarer genotypes calculated using the ratio of Stoddart & Tayolor index (diversity index weighted for more abundant alleles) & Shannon-Wiener index (diversity index weighted for more rare alleles).

Compare differences in evenness of allelic diversity among tributaries within runs

Test of significant differences among runs using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  EVENNESS and GRP and LOCUS
## Friedman chi-squared = 791.32, df = 16, p-value < 0.00000000000000022

Test for significant pairwise differences between runs using Wilcoxon signed rank test.

Significance of pairwise comparisons of evenness of allelic diversity per locus for individuals grouped by run type within tributaries using using Wilcoxon signed rank test. Tile color indicates stat (indicates direction of difference); p-values were adjusted for multiple comparisons using Benjamini-Hochberg.

Significance of pairwise comparisons of evenness of allelic diversity per locus for individuals grouped by run type within tributaries using using Wilcoxon signed rank test. Tile color indicates stat (indicates direction of difference); p-values were adjusted for multiple comparisons using Benjamini-Hochberg.

Significance of pairwise comparisons of evenness of allelicdiversity across loci for individuals grouped by run type withintributaries.

pop1 pop2 stat p.value p_adj
F_FRH W_USR 21.10 0.0000 0.0000
S_FRH W_USR 20.80 0.0000 0.0000
F_MIL W_USR 20.36 0.0000 0.0000
S_MIL W_USR 20.31 0.0000 0.0000
F_COL W_USR 20.18 0.0000 0.0000
S_DER W_USR 19.95 0.0000 0.0000
F_DER W_USR 19.24 0.0000 0.0000
F_MER W_USR 19.24 0.0000 0.0000
F_MRH W_USR 19.22 0.0000 0.0000
F_NIM W_USR 19.12 0.0000 0.0000
F_STN W_USR 19.02 0.0000 0.0000
F_MKH W_USR 18.97 0.0000 0.0000
F_BUT W_USR 18.78 0.0000 0.0000
F_TOU W_USR 18.78 0.0000 0.0000
L_USR W_USR 18.59 0.0000 0.0000
S_BUT W_USR 17.83 0.0000 0.0000
F_FRH F_TOU 4.65 0.0000 0.0000
S_FRH S_BUT 4.61 0.0000 0.0000
F_FRH S_BUT 4.55 0.0000 0.0000
F_FRH L_USR 4.34 0.0000 0.0000
S_FRH F_TOU 4.24 0.0000 0.0000
F_FRH F_BUT 4.20 0.0000 0.0000
S_FRH L_USR 4.19 0.0000 0.0000
F_FRH F_MKH 4.04 0.0001 0.0005
F_FRH F_NIM 3.99 0.0001 0.0005
F_FRH F_MER 3.91 0.0001 0.0005
S_FRH F_NIM 3.89 0.0001 0.0005
S_FRH F_BUT 3.89 0.0001 0.0005
S_FRH F_MKH 3.82 0.0001 0.0005
S_FRH F_MER 3.75 0.0002 0.0009
S_MIL S_BUT 3.63 0.0003 0.0013
F_MIL S_BUT 3.55 0.0004 0.0016
F_MIL F_BUT 3.53 0.0004 0.0016
F_FRH F_STN 3.50 0.0005 0.0019
F_MIL F_TOU 3.47 0.0005 0.0019
S_FRH F_STN 3.41 0.0007 0.0026
S_FRH F_DER 3.40 0.0007 0.0026
S_FRH S_DER 3.34 0.0008 0.0029
F_FRH F_DER 3.34 0.0009 0.0031
S_FRH F_COL 3.31 0.0009 0.0031
F_FRH F_COL 3.30 0.0010 0.0033
F_MIL L_USR 3.12 0.0018 0.0058
F_MIL F_NIM 3.10 0.0020 0.0063
F_FRH S_DER 2.96 0.0031 0.0096
F_MIL F_MKH 2.76 0.0058 0.0175
F_COL S_BUT 2.72 0.0065 0.0192
F_MIL F_DER 2.70 0.0068 0.0197
F_MRH S_BUT 2.62 0.0088 0.0249
F_MIL F_MER 2.55 0.0107 0.0297
S_FRH F_MRH 2.52 0.0116 0.0316
S_DER S_BUT 2.51 0.0121 0.0323
F_FRH F_MRH 2.39 0.0171 0.0447
S_MIL L_USR 2.34 0.0191 0.0490

Compare distributions of values.

Distribution of evenness of allelic diversity per locus for individuals grouped by run type within tributaries.

Distribution of evenness of allelic diversity per locus for individuals grouped by run type within tributaries.

Table S7: Mean, standard deviation, median, 25th, and 75thquartile of evenness of evenness per locus for individuals grouped byrun type & tributary.

GRP mean sd median Q1 Q3
F_COL 0.7645 0.1556 0.7720 0.6506 0.8990
F_MIL 0.7681 0.1568 0.7747 0.6461 0.8990
F_DER 0.7631 0.1612 0.7720 0.6257 0.8990
F_BUT 0.7617 0.1602 0.7747 0.6336 0.8990
F_FRH 0.7697 0.1540 0.7823 0.6478 0.8990
F_NIM 0.7631 0.1582 0.7720 0.6321 0.8990
F_MKH 0.7632 0.1580 0.7703 0.6397 0.8990
F_STN 0.7631 0.1596 0.7786 0.6336 0.8990
F_TOU 0.7620 0.1585 0.7703 0.6353 0.8988
F_MRH 0.7648 0.1614 0.7720 0.6397 0.8990
F_MER 0.7634 0.1581 0.7720 0.6321 0.8984
L_USR 0.7609 0.1613 0.7769 0.6397 0.8990
W_USR 0.6925 0.1988 0.6853 0.5040 0.8694
S_MIL 0.7663 0.1615 0.7734 0.6397 0.8990
S_DER 0.7633 0.1600 0.7684 0.6397 0.8990
S_BUT 0.7578 0.1695 0.7769 0.6223 0.9085
S_FRH 0.7715 0.1577 0.7534 0.6397 0.9240

3.7 Nucleotide diversity \(\pi\) (pairwise differences)

\(\theta_{T}\) (Nei 1987), the nucleotide diversity (\(\pi\)) is calculated as the sum of the number of differences between pairs of haplotypes of a given locus, divided by the number of comparisons made. This parameter is biased towards alleles segregating at intermediate rates and will underestimate genetic diversity due to rare alleles.

Create a tidy data set of variant call information (position, alternate alleles) from filtered VCF file used to create haplotypes. Create a data frame of haplotype sequences in the data set(s) using reference contig sequences, VCF information, and codes.out file from rad_haplotyper.

# Read in the reference sequence
fa <- seqinr::read.fasta("data/REF/CHRIST2018/reference.fasta")

# create tibble with locus name and sequence
ref <- tibble::tibble(name = seqinr::getName(fa), seq = toupper(unlist(seqinr::getSequence(fa, as.string = TRUE))))

# Read the haplotype VCF file and put it in tidy format
vcf_tidy <- read.vcfR("data/POPGEN/ONC.haps.filtered.recode.vcf") %>%
    vcfR2tidy()

# Get the data frame with the positions
pos_tbl <- vcf_tidy$fix %>%
  filter(ALT %in% c("A", "T", "C", "G")) %>%
  unite(LOCUS, CHROM, POS, sep = "_", remove = FALSE)


# Make a data frame with all of the haplotype sequences from the 'codes.out' file from rad_haplotyper
hap_index <- read_tsv("data/HAPLOTYPING/codes.ONC.gen", col_names = FALSE) %>%
    filter(X1 %in% locNames(gen_obj)) %>%
    split(.$X1) %>%
    purrr::map(function(x) {
      codes <- str_split(x$X2, ",") %>%
        unlist()

      code_tbl <- tibble(locus = x$X1, code = codes) %>%
        separate(code, c("hap", "code"), sep = ":") %>%
        left_join(ref, by = c("locus" = "name"))

      pos <- pos_tbl %>%
        filter(CHROM == code_tbl$locus[1]) %>%
        pull(POS)

      for (i in 1:nrow(code_tbl)) {

        alleles <- str_split(code_tbl$hap[i], "") %>%
          unlist()

        replace <- tibble(pos = pos, allele = alleles)

        for (j in 1:nrow(replace)) {
          str_sub(code_tbl$seq[i], replace$pos[j], 1) <- replace$allele[j]
        }

      }

      code_tbl

    }) %>%
    bind_rows()

Calculate nucleotide diversity for individuals grouped by run type within each tributary.

# set pops
setPop(gen) <- ~RUN_LOC

# create tidy data set
gen_tidy <- gen %>%
  genind2df(oneColPerAll = TRUE) %>%
  rownames_to_column(var = "ind") %>%
  gather(allele, code, -pop, -ind) %>%
  separate(allele, c("locus", "allele"), sep = "\\.") %>%
  filter(pop %in% trib_runs) %>%
  droplevels()

# Calculate haplotype related stats
hap_div_tbl <- locNames(gen) %>%
  purrr::map(function(y) {

    # for each locus
    gen_tidy %>%
      filter(locus == y) %>%
      filter(!code == "NA") %>%
      left_join(hap_index, by = c("code", "locus")) %>%
      as.tibble() %>%
      arrange(factor(pop, levels = trib_runs)) %>%

      # for each population per locus
      split(.$pop) %>%
      purrr::map(function(x) {
        y <- do.call(rbind, strsplit(x$seq, ""))
        rownames(y) <- paste(x$ind, x$allele, sep = ".")

        # create DNAbin object for locus
        dna_bin <- as.DNAbin(y)                            

        # calculate Tajima's D and p-values
        taj_test <- tajima.test(dna_bin)

        # create table with results
        tibble(locus = x$locus[1],
               pop = x$pop[1],
               nuc_div = nuc.div(dna_bin)
               
      }) %>%
      bind_rows()
  }) %>%
    bind_rows()

write_delim(hap_div_tbl, "results/trib_run.hapdiv", delim = "\t")

Test of significant differences among runs using Friedman’s test. To get an unreplicated complete block design only loci variable in all locations were used for test of heterogeneity among groups.

## 
##  Friedman rank sum test
## 
## data:  nuc_div and pop and locus
## Friedman chi-squared = 3466.1, df = 16, p-value < 0.00000000000000022

Test for significance of pairwise comparisons using Wilcoxon signed rank test.

Pairwise comparison of levels of significant differences among individuals grouped by tributary and run type.

Pairwise comparison of levels of significant differences among individuals grouped by tributary and run type.

Compare distribution of values.

Distribution of nucleotide diversity across all loci for individuals grouped by tributary and runs

Distribution of nucleotide diversity across all loci for individuals grouped by tributary and runs

Significance of pairwise comparisons of nucleotide diversityamong runs within tributaries.

pop1 pop2 stat p.value p_adj
F_FRH W_USR 39.1630 0.0000 0.0000
S_DER W_USR 38.7344 0.0000 0.0000
F_COL W_USR 38.0562 0.0000 0.0000
F_MIL W_USR 36.2643 0.0000 0.0000
F_MER W_USR 35.8979 0.0000 0.0000
F_NIM W_USR 35.4991 0.0000 0.0000
F_TOU W_USR 35.3306 0.0000 0.0000
F_MKH W_USR 34.9710 0.0000 0.0000
F_STN W_USR 34.8903 0.0000 0.0000
L_USR W_USR 34.3190 0.0000 0.0000
F_BUT W_USR 33.7188 0.0000 0.0000
S_MIL W_USR 33.4761 0.0000 0.0000
F_DER W_USR 33.3037 0.0000 0.0000
F_MRH W_USR 32.6041 0.0000 0.0000
S_BUT W_USR 29.3361 0.0000 0.0000
S_FRH W_USR 25.7698 0.0000 0.0000
F_FRH S_BUT 10.6605 0.0000 0.0000
F_MIL S_BUT 9.5402 0.0000 0.0000
F_COL S_BUT 9.3455 0.0000 0.0000
S_DER S_BUT 9.2374 0.0000 0.0000
F_FRH F_MER 8.4649 0.0000 0.0000
F_FRH F_MKH 8.0066 0.0000 0.0000
F_FRH F_NIM 7.8470 0.0000 0.0000
F_MIL F_BUT 7.8298 0.0000 0.0000
F_FRH F_BUT 7.8248 0.0000 0.0000
S_MIL S_BUT 7.5732 0.0000 0.0000
F_FRH L_USR 7.5485 0.0000 0.0000
F_FRH F_TOU 7.3375 0.0000 0.0000
F_NIM S_BUT 7.2899 0.0000 0.0000
F_MIL L_USR 7.1077 0.0000 0.0000
F_MER S_BUT 7.0675 0.0000 0.0000
F_MIL F_STN 6.9027 0.0000 0.0000
F_MKH S_BUT 6.7397 0.0000 0.0000
F_FRH F_DER 6.4714 0.0000 0.0000
F_MRH S_BUT 6.4262 0.0000 0.0000
F_TOU S_BUT 6.4188 0.0000 0.0000
F_FRH S_FRH 6.3862 0.0000 0.0000
F_DER S_BUT 6.1653 0.0000 0.0000
F_FRH F_STN 6.1338 0.0000 0.0000
F_COL L_USR 5.6465 0.0000 0.0000
F_FRH S_MIL 5.6304 0.0000 0.0000
F_FRH F_COL 5.0746 0.0000 0.0000
F_MIL F_TOU 4.9916 0.0000 0.0000
F_COL F_BUT 4.9583 0.0000 0.0000
F_FRH F_MRH 4.4841 0.0000 0.0000
F_MIL F_MKH 4.4521 0.0000 0.0000
F_COL S_FRH 4.4422 0.0000 0.0000
S_DER S_FRH 4.3559 0.0000 0.0000
F_MIL S_FRH 4.3219 0.0000 0.0000
F_MIL F_MER 4.2738 0.0000 0.0000
F_STN S_BUT 4.2074 0.0000 0.0000
F_BUT S_BUT 4.1849 0.0000 0.0000
S_DER F_TOU 3.9544 0.0001 0.0003
F_MIL F_NIM 3.8064 0.0001 0.0003
F_COL F_STN 3.4889 0.0005 0.0012
F_MRH F_BUT 3.4750 0.0005 0.0012
S_DER F_BUT 3.3535 0.0008 0.0019
S_DER F_MKH 3.2866 0.0010 0.0023
F_DER F_BUT 3.2501 0.0012 0.0027
F_DER L_USR 3.2285 0.0012 0.0027
L_USR S_BUT 3.1631 0.0016 0.0035
S_DER F_MER 3.1512 0.0016 0.0035
F_NIM F_BUT 2.8420 0.0045 0.0097
S_DER F_NIM 2.8292 0.0047 0.0100
S_DER L_USR 2.7652 0.0057 0.0119
F_NIM S_FRH 2.7592 0.0058 0.0120
F_MIL F_DER 2.7509 0.0059 0.0120
F_NIM L_USR 2.7437 0.0061 0.0122
F_MKH L_USR 2.7356 0.0062 0.0122
F_MRH F_DER 2.6655 0.0077 0.0150
F_MER F_BUT 2.6482 0.0081 0.0155
F_MKH F_MER 2.5498 0.0108 0.0203
F_COL F_MER 2.5450 0.0109 0.0203
S_DER F_DER 2.4792 0.0132 0.0243
F_MRH L_USR 2.4721 0.0134 0.0243
F_MKH F_NIM 2.4247 0.0153 0.0274
F_MER L_USR 2.3802 0.0173 0.0306
S_DER S_MIL 2.3737 0.0176 0.0307
F_TOU S_FRH 2.3173 0.0205 0.0353
S_FRH S_BUT 2.3070 0.0211 0.0359
S_DER F_STN 2.3011 0.0214 0.0359
F_COL S_MIL 2.2854 0.0223 0.0370
F_MKH F_BUT 2.2459 0.0247 0.0403
F_MER S_FRH 2.2424 0.0249 0.0403
S_MIL L_USR 2.2186 0.0265 0.0424
S_DER F_MRH 2.1733 0.0298 0.0471
F_MKH S_FRH 2.1338 0.0329 0.0514
F_BUT L_USR 2.1021 0.0355 0.0549
S_MIL F_BUT 2.0595 0.0394 0.0602
F_TOU F_BUT 1.9216 0.0547 0.0827
F_TOU L_USR 1.8511 0.0642 0.0959
F_MIL F_MRH 1.7711 0.0765 0.1131
F_MIL S_MIL 1.7501 0.0801 0.1160
S_MIL F_STN 1.7494 0.0802 0.1160
F_MRH S_MIL 1.7340 0.0829 0.1187
F_NIM F_STN 1.6828 0.0924 0.1308
F_MRH F_STN 1.6749 0.0940 0.1308
F_FRH S_DER 1.6699 0.0949 0.1308
F_STN L_USR 1.6688 0.0952 0.1308
F_MIL F_COL 1.6428 0.1004 0.1365
F_COL F_TOU 1.5766 0.1149 0.1547
F_MIL S_DER 1.5062 0.1320 0.1760
F_NIM F_MER 1.4076 0.1593 0.2103
F_MRH S_FRH 1.3832 0.1666 0.2179
F_MER F_STN 1.3571 0.1748 0.2264
F_MRH F_MER 1.3084 0.1907 0.2434
F_COL F_MKH 1.3062 0.1915 0.2434
F_MKH F_DER 1.2650 0.2059 0.2593
F_TOU F_MER 1.2137 0.2249 0.2806
F_COL F_DER 1.1711 0.2415 0.2986
F_MRH F_TOU 1.1316 0.2578 0.3159
F_FRH F_MIL 1.1019 0.2705 0.3285
F_MKH S_MIL 1.0733 0.2832 0.3408
F_COL F_NIM 1.0207 0.3074 0.3667
F_BUT F_STN 1.0124 0.3113 0.3681
F_NIM S_MIL 0.9788 0.3277 0.3842
F_MRH F_MKH 0.9442 0.3451 0.4011
F_DER F_STN 0.8775 0.3802 0.4382
F_TOU F_NIM 0.8558 0.3921 0.4481
F_COL F_MRH 0.8145 0.4154 0.4708
F_BUT S_FRH 0.7313 0.4646 0.5222
F_NIM F_DER 0.7121 0.4764 0.5311
F_STN S_FRH 0.6743 0.5001 0.5530
S_MIL S_FRH 0.5758 0.5648 0.6158
F_MRH F_NIM 0.5739 0.5660 0.6158
S_DER F_COL 0.5644 0.5725 0.6179
F_MKH F_TOU 0.4981 0.6184 0.6622
F_TOU S_MIL 0.4698 0.6385 0.6784
S_FRH L_USR 0.4294 0.6676 0.7038
F_TOU F_DER 0.3588 0.7197 0.7529
F_MER S_MIL 0.3470 0.7286 0.7564
F_DER S_MIL 0.3213 0.7479 0.7706
F_MKH F_STN 0.3003 0.7639 0.7811
F_STN F_TOU 0.2000 0.8415 0.8541
F_DER S_FRH 0.1108 0.9118 0.9186
F_MER F_DER 0.0505 0.9597 0.9597

Table S8: Comparison of mean +/- standard deviation, minimum,and maximum values of the nucleotide diversity (pi) across loci amongindividuals grouped by run within tributary.

pop mean std max min
F_COL 0.001045 0.001275 0.014935 0
F_MIL 0.001056 0.001298 0.015935 0
F_DER 0.001038 0.001294 0.014056 0
F_BUT 0.001026 0.001292 0.014613 0
F_FRH 0.001058 0.001272 0.013546 0
F_NIM 0.001034 0.001271 0.014977 0
F_MKH 0.001035 0.001276 0.015955 0
F_STN 0.001028 0.001281 0.014589 0
F_TOU 0.001033 0.001275 0.014795 0
F_MRH 0.001042 0.001311 0.015478 0
F_MER 0.001035 0.001274 0.014661 0
L_USR 0.001029 0.001289 0.014423 0
W_USR 0.000790 0.001212 0.013255 0
S_MIL 0.001043 0.001315 0.016022 0
S_DER 0.001046 0.001275 0.015410 0
S_BUT 0.001012 0.001308 0.015957 0
S_FRH 0.001048 0.001392 0.014501 0

3.8 Tajima’s D

\(\theta\) is the population-scaled mutation rate and can be estimated as \(\hat{\theta}_T\) (Nei 1987), the number of pairwise differences (nucleotide diversity, \(\pi\)), while \(\hat{\theta}_W\) is measured as the number of segregating sites (Watterson 1975).

Because \(\hat{\theta}_T\) will underestimate the number of mutations that are rare in the population, Tajima’s \(D\) can be used to test the neutral mutation hypothesis:

3.8.1 Pairwise comparison of Tajima’s D across populations

Test of significant differences among runs using Friedman’s test. To get an unreplicated complete block design only loci variable (Tajima’s D is able to be calculated) in all locations were used for test of heterogeneity among groups.

## 
##  Friedman rank sum test
## 
## data:  tajima_d and pop and locus
## Friedman chi-squared = 3033.8, df = 16, p-value < 0.00000000000000022

Test for significance of pairwise comparisons using Wilcoxon signed rank test.

Pairwise comparison of levels of significant differences among individuals grouped by tributary and run type.

Pairwise comparison of levels of significant differences among individuals grouped by tributary and run type.

Compare distribution of values.

Distribution of nucleotide diversity across all loci for individuals grouped by tributary and runs

Distribution of nucleotide diversity across all loci for individuals grouped by tributary and runs

Significance of pairwise comparisons of nucleotide diversityamong runs within tributaries.

pop1 pop2 stat p.value p_adj
F_FRH S_FRH 26.5849 0.0000 0.0000
F_MER S_FRH 26.2976 0.0000 0.0000
F_COL S_FRH 25.5437 0.0000 0.0000
F_NIM S_FRH 24.8900 0.0000 0.0000
F_MKH S_FRH 24.0444 0.0000 0.0000
F_TOU S_FRH 23.8379 0.0000 0.0000
S_DER S_FRH 21.3976 0.0000 0.0000
F_STN S_FRH 19.9711 0.0000 0.0000
F_COL W_USR 19.7533 0.0000 0.0000
F_MER W_USR 19.2801 0.0000 0.0000
F_MER F_DER 19.2761 0.0000 0.0000
F_MER F_MRH 19.2191 0.0000 0.0000
F_FRH W_USR 19.1950 0.0000 0.0000
F_NIM W_USR 18.7630 0.0000 0.0000
L_USR S_FRH 18.6359 0.0000 0.0000
F_COL F_DER 18.4471 0.0000 0.0000
F_FRH F_MRH 18.4293 0.0000 0.0000
F_COL F_MRH 18.3611 0.0000 0.0000
F_NIM F_MRH 18.1141 0.0000 0.0000
S_DER W_USR 17.9094 0.0000 0.0000
F_NIM F_DER 17.6786 0.0000 0.0000
F_FRH F_DER 17.6316 0.0000 0.0000
F_BUT S_FRH 17.6218 0.0000 0.0000
F_MKH W_USR 17.5967 0.0000 0.0000
F_TOU W_USR 17.5878 0.0000 0.0000
F_MIL S_FRH 17.2354 0.0000 0.0000
F_MKH F_MRH 16.5180 0.0000 0.0000
F_MKH F_DER 16.3529 0.0000 0.0000
F_TOU F_MRH 16.0814 0.0000 0.0000
F_TOU F_DER 15.2753 0.0000 0.0000
F_STN W_USR 14.9442 0.0000 0.0000
S_BUT S_FRH 14.7833 0.0000 0.0000
L_USR W_USR 14.6604 0.0000 0.0000
S_MIL S_FRH 14.0956 0.0000 0.0000
F_MER F_BUT 13.6490 0.0000 0.0000
F_BUT W_USR 13.4927 0.0000 0.0000
F_COL F_MIL 13.4734 0.0000 0.0000
F_DER S_FRH 13.3879 0.0000 0.0000
F_MIL W_USR 13.2877 0.0000 0.0000
F_MER F_MIL 13.0190 0.0000 0.0000
S_BUT W_USR 12.9179 0.0000 0.0000
F_COL S_MIL 12.6701 0.0000 0.0000
F_NIM F_BUT 12.5090 0.0000 0.0000
F_MRH S_FRH 12.4327 0.0000 0.0000
F_FRH F_MIL 12.3683 0.0000 0.0000
F_FRH S_MIL 12.3372 0.0000 0.0000
F_COL F_BUT 12.3272 0.0000 0.0000
F_MER S_MIL 12.2343 0.0000 0.0000
F_NIM F_MIL 12.1236 0.0000 0.0000
F_FRH F_BUT 12.0649 0.0000 0.0000
S_MIL W_USR 11.7525 0.0000 0.0000
F_NIM S_MIL 11.6273 0.0000 0.0000
S_DER F_MRH 11.5480 0.0000 0.0000
S_DER S_MIL 11.3109 0.0000 0.0000
S_DER F_DER 11.2090 0.0000 0.0000
F_STN F_MRH 10.7660 0.0000 0.0000
F_DER W_USR 10.3608 0.0000 0.0000
F_MER F_STN 10.3304 0.0000 0.0000
F_MER L_USR 10.2101 0.0000 0.0000
F_MKH F_BUT 10.1922 0.0000 0.0000
F_MKH F_MIL 9.9919 0.0000 0.0000
F_COL S_BUT 9.9542 0.0000 0.0000
F_NIM F_STN 9.8380 0.0000 0.0000
F_COL L_USR 9.8129 0.0000 0.0000
F_STN F_DER 9.6685 0.0000 0.0000
F_MKH S_MIL 9.6327 0.0000 0.0000
F_MER S_BUT 9.5989 0.0000 0.0000
F_MRH W_USR 9.5451 0.0000 0.0000
F_TOU F_MIL 9.4676 0.0000 0.0000
F_COL F_STN 9.4089 0.0000 0.0000
F_FRH S_BUT 9.4079 0.0000 0.0000
F_TOU F_BUT 9.3481 0.0000 0.0000
F_TOU S_MIL 9.3395 0.0000 0.0000
L_USR F_MRH 9.1500 0.0000 0.0000
F_FRH L_USR 9.0957 0.0000 0.0000
F_FRH F_STN 8.8313 0.0000 0.0000
F_NIM S_BUT 8.7719 0.0000 0.0000
F_NIM L_USR 8.7611 0.0000 0.0000
L_USR F_DER 8.1895 0.0000 0.0000
S_DER S_BUT 7.8476 0.0000 0.0000
F_BUT F_MRH 7.3951 0.0000 0.0000
F_TOU S_BUT 7.3056 0.0000 0.0000
F_MKH S_BUT 7.0906 0.0000 0.0000
F_MKH F_STN 6.8108 0.0000 0.0000
F_BUT F_DER 6.7771 0.0000 0.0000
S_DER F_MIL 6.7527 0.0000 0.0000
F_TOU L_USR 6.3255 0.0000 0.0000
F_MIL F_MRH 6.3219 0.0000 0.0000
F_MKH L_USR 6.2571 0.0000 0.0000
S_DER F_BUT 6.1590 0.0000 0.0000
F_TOU F_STN 5.8991 0.0000 0.0000
F_MIL F_DER 5.8486 0.0000 0.0000
F_STN S_MIL 5.2547 0.0000 0.0000
S_BUT F_MRH 4.5124 0.0000 0.0000
L_USR S_MIL 4.4513 0.0000 0.0000
F_MER F_TOU 4.4447 0.0000 0.0000
F_MER F_MKH 4.2973 0.0000 0.0000
S_DER L_USR 4.2409 0.0000 0.0000
S_BUT F_DER 3.8866 0.0001 0.0001
F_COL S_DER 3.8819 0.0001 0.0001
F_STN F_MIL 3.8198 0.0001 0.0001
F_COL F_MKH 3.8111 0.0001 0.0001
F_COL F_TOU 3.7087 0.0002 0.0003
S_DER F_STN 3.6931 0.0002 0.0003
F_MER S_DER 3.6497 0.0003 0.0004
F_FRH S_DER 3.3763 0.0007 0.0009
F_STN F_BUT 3.2806 0.0010 0.0013
F_NIM F_TOU 3.2762 0.0011 0.0014
F_STN S_BUT 3.2630 0.0011 0.0014
F_FRH F_TOU 3.2079 0.0013 0.0016
L_USR F_MIL 3.1097 0.0019 0.0023
F_NIM S_DER 3.0337 0.0024 0.0029
F_BUT S_MIL 3.0331 0.0024 0.0029
F_FRH F_MKH 2.9880 0.0028 0.0033
F_NIM F_MKH 2.8695 0.0041 0.0048
S_MIL F_MRH 2.7808 0.0054 0.0063
L_USR S_BUT 2.4143 0.0158 0.0184
F_MIL S_MIL 2.3319 0.0197 0.0227
S_BUT S_MIL 1.8825 0.0598 0.0683
L_USR F_BUT 1.8600 0.0629 0.0713
S_MIL F_DER 1.6147 0.1064 0.1196
F_STN L_USR 1.0432 0.2968 0.3309
F_MER F_NIM 0.9721 0.3310 0.3660
F_COL F_NIM 0.9512 0.3415 0.3745
F_MKH S_DER 0.8753 0.3814 0.4150
F_DER F_MRH 0.8548 0.3927 0.4239
F_BUT S_BUT 0.7211 0.4708 0.5042
F_COL F_FRH 0.7129 0.4759 0.5056
F_TOU S_DER 0.6973 0.4856 0.5120
F_MER F_FRH 0.5523 0.5808 0.6076
S_FRH W_USR 0.3931 0.6943 0.7208
F_MKH F_TOU 0.3647 0.7153 0.7370
F_MIL S_BUT 0.3156 0.7523 0.7693
F_FRH F_NIM 0.2919 0.7703 0.7818
F_MER F_COL 0.2378 0.8121 0.8181
F_BUT F_MIL 0.1100 0.9124 0.9124

Table S9: Comparison of mean +/- standard deviation, minimum,and maximum values of the Tajima’s D across loci among individualsgrouped by run within tributary.

pop mean std max min
W_USR 0.094266 1.035023 2.669786 -1.858645
S_BUT 0.049145 0.954543 2.709811 -2.180529
L_USR -0.046518 0.956674 2.592781 -1.876505
S_MIL -0.050683 0.963661 2.447507 -1.889476
F_MKH -0.058910 0.953114 2.587745 -1.763630
F_NIM -0.059634 0.951602 2.662807 -1.847405
S_DER -0.062072 0.945730 2.648186 -1.866038
F_MER -0.063557 0.953941 2.712666 -1.975765
F_TOU -0.068298 0.951630 2.704980 -1.854902
F_COL -0.071360 0.942641 2.673614 -1.847405
F_FRH -0.076019 0.946088 2.648186 -1.764299
F_STN -0.080884 0.962657 2.556259 -2.001031
S_FRH -0.087391 0.977333 2.145134 -1.797595
F_MRH -0.088288 0.959500 2.483303 -2.005646
F_MIL -0.093226 0.953617 2.386602 -1.900752
F_BUT -0.095481 0.956493 2.495917 -1.879671
F_DER -0.107183 0.962557 2.418479 -1.733626

3.8.2 Comparison to simulated data set assuming mutation-drift equilibrium

Without detailed demographic information, it can be difficult to distinguish between an effect of selection, population expansion/decline and drift compared to demographics. Calculating a conventional p-value is problematic because you cannot describe the distribution of Tajimas \(D\) independent of true \(\theta\) value which is unknown.

Originally, Tajima proposed comparing the distribution to a beta distribution, however this has been shown to be too conservative. Instead, to identify whether the observed genome-wide distribution of Tajima’s \(D\) reflects the expected distribution under the neutral mutation hypothesis, we will generate a genome-wide null distribution of Tajima’s \(D\) for a set of neutral loci reflecting the composition of the haplotyped empirical data set consisting of the same number of loci with same distribution of segregating sites by simulating loci under coalescence using MS (Hudson 2002) to compare to the empirical data set.

3.8.2.1 F_COL

The maximum number of segregating sites per locus in the data set is 7, there are 30) F_COL run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 30 individuals (60 observations).


# 2*number of individuals in data set
N=60

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind60_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind60")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_COL"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all <- list()

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.15772, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.2 F_MIL

The maximum number of segregating sites per locus in the data set is 7, there are 20) F_MIL run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 20 individuals (40 observations).


# 2*number of individuals in data set
N=40

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind40_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind40")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_MIL"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.21342, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.3 F_DER

The maximum number of segregating sites per locus in the data set is 7, there are 15) F_DER run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 15 individuals (30 observations).


# 2*number of individuals in data set
N=30

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind30_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind30")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_DER"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.12052, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.4 F_BUT

The maximum number of segregating sites per locus in the data set is 7, there are 21) F_BUT run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 21 individuals (42 observations).


# 2*number of individuals in data set
N=42

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind42_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind60")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_BUT"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.20954, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.5 F_FRH

The maximum number of segregating sites per locus in the data set is 7, there are 27) F_FRH run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 27 individuals (54 observations).


# 2*number of individuals in data set
N=54

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind54_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind54")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_FRH"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.10915, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.6 F_NIM

The maximum number of segregating sites per locus in the data set is 7, there are 30) F_NIM run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 30 individuals (60 observations).


# 2*number of individuals in data set
N=60

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind60_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind60")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_NIM"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.16563, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.7 F_MKH

The maximum number of segregating sites per locus in the data set is 7, there are 28) F_MKH run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 28 individuals (56 observations).


# 2*number of individuals in data set
N=56

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind56_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind56")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_MKH"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.16758, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.8 F_STN

The maximum number of segregating sites per locus in the data set is 7, there are 23) F_STN run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 23 individuals (46 observations).


# 2*number of individuals in data set
N=46

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind46_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind46")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_STN"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.19666, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.9 F_TOU

The maximum number of segregating sites per locus in the data set is 7, there are 30) F_TOU run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 30 individuals (60 observations).


# 2*number of individuals in data set
N=60

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind60_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind60")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_TOU"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.1764, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.10 F_MRH

The maximum number of segregating sites per locus in the data set is 7, there are 15) F_MRH run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 15 individuals (30 observations).


# 2*number of individuals in data set
N=30

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind30_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind30")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_MRH"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.19097, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.11 F_MER

The maximum number of segregating sites per locus in the data set is 7, there are 31) F_MER run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 31 individuals (62 observations).


# 2*number of individuals in data set
N=62

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind62_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind62")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "F_MER"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.16428, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.12 L_USR

The maximum number of segregating sites per locus in the data set is 7, there are 21) L_USR run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 21 individuals (42 observations).


# 2*number of individuals in data set
N=42

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind42_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind42")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "L_USR"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.12979, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.13 W_USR

The maximum number of segregating sites per locus in the data set is 7, there are 26) W_USR run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 26 individuals (52 observations).


# 2*number of individuals in data set
N=52

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind52_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind42")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "W_USR"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.21578, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.14 S_MIL

The maximum number of segregating sites per locus in the data set is 7, there are 16) S_MIL run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 16 individuals (32 observations).


# 2*number of individuals in data set
N=32

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind32_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind32")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "S_MIL"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.16933, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.15 S_DER

The maximum number of segregating sites per locus in the data set is 7, there are 27) S_DER run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 27 individuals (54 observations).


# 2*number of individuals in data set
N=54

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind54_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind54")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "S_DER"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.14984, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.16 S_BUT

The maximum number of segregating sites per locus in the data set is 7, there are 19) S_BUT run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 19 individuals (38 observations).


# 2*number of individuals in data set
N=38

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind38_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind38")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "S_BUT"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0) %>%
  filter(pop == grp) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.15249, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.17 S_FRH

The maximum number of segregating sites per locus in the data set is 7, there are 7) S_FRH run individuals in the data set.

Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 7 individuals (14 observations).


# 2*number of individuals in data set
N=14

# number of loci to simulate
REP=1000

# simulate neutral loci
for i in {1..7}
do

  scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind14_Sites${i}_neut_Rep1000.stats

done

The sample_stat script distributed along side MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.

Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind14")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

reps <- 100

# group to be compared
grp <- "S_FRH"

# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(pop == grp) %>%
  filter(!seg_sites == 0) %>%
  count(seg_sites)

# create null distributions
null_distributions <- list()

for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     null_distributions[[s]] <- sim %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
    mutate(pop = grp)

taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
  select(locus, pop, seg_sites, tajima_d) %>%
  filter(!seg_sites == 0,
         pop == grp) %>%
  bind_rows(null_distributions) %>%
  mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
  select(-N_IND, -N_REPS, -pi, -thetaH, -H)

taj_all[[grp]] <- taj

Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).

Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima's D under equilibrium assuming a beta distribution (the median should intersect at 0).
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.30666, p-value < 0.00000000000000022
## alternative hypothesis: two-sided

3.8.2.18 Comparison across all trib/run populations

Let’s compare the null and empirical distributions across all data sets.

Comparison of neutral and empirical distributions of Tajima’s D for each run in each tributary.

Comparison of neutral and empirical distributions of Tajima's D for each run in each tributary.

Table S10: Mean, median, 25% and 75% quantiles Tajima’s D forempirical and simulated data sets.

pop scenario mean median Q1 Q3
F_COL empirical -0.0714 -0.3536 -0.8912 0.6642
F_COL neut -0.0308 -0.2007 -0.8912 0.7825
F_MIL empirical -0.0932 -0.2712 -1.0037 0.6690
F_MIL neut -0.0501 -0.3066 -0.8360 0.7445
F_DER empirical -0.1072 -0.3634 -1.1470 0.7267
F_DER neut -0.0530 -0.3228 -1.1470 0.7685
F_BUT empirical -0.0955 -0.3066 -0.8452 0.6820
F_BUT neut -0.0299 -0.2824 -0.8912 0.7825
F_FRH empirical -0.0760 -0.2896 -0.8810 0.6669
F_FRH neut -0.0159 -0.2896 -0.8810 0.9185
F_NIM empirical -0.0596 -0.3337 -0.8912 0.6642
F_NIM neut -0.0179 -0.1879 -0.8912 0.7825
F_MKH empirical -0.0589 -0.2896 -0.8848 0.6669
F_MKH neut -0.0381 -0.3125 -0.9130 0.7505
F_STN empirical -0.0809 -0.3071 -0.8602 0.7231
F_STN neut -0.0592 -0.2595 -0.8602 0.7261
F_TOU empirical -0.0683 -0.3125 -0.8848 0.6669
F_TOU neut -0.0197 -0.1879 -0.8912 0.7825
F_MRH empirical -0.0883 -0.3105 -1.1470 0.7267
F_MRH neut -0.0427 -0.2761 -1.1470 0.8711
F_MER empirical -0.0636 -0.3337 -0.8912 0.6642
F_MER neut -0.0231 -0.2108 -0.8939 0.7411
L_USR empirical -0.0465 -0.2897 -0.8452 0.7445
L_USR neut -0.0576 -0.3385 -0.8452 0.6820
W_USR empirical 0.0943 -0.0790 -0.8767 1.0812
W_USR neut -0.0639 -0.3385 -0.9230 0.6820
S_MIL empirical -0.0507 -0.1747 -0.9721 0.7592
S_MIL neut -0.0042 -0.1384 -1.0305 0.8533
S_DER empirical -0.0621 -0.2649 -0.8719 0.6723
S_DER neut -0.0589 -0.2896 -0.8810 0.7969
S_BUT empirical 0.0491 -0.0201 -0.8134 0.8827
S_BUT neut -0.0507 -0.2712 -1.1020 0.8113
S_FRH empirical -0.0874 -0.3414 -1.1552 0.8423
S_FRH neut -0.0478 -0.3414 -1.1552 0.8423

Difference in mean Tajima’s D between empirical and simulateddata sets.

pop empirical neut difference
W_USR 0.0943 -0.0639 0.1582
S_BUT 0.0491 -0.0507 0.0998
L_USR -0.0465 -0.0576 0.0111
S_DER -0.0621 -0.0589 -0.0031
F_MKH -0.0589 -0.0381 -0.0209
F_STN -0.0809 -0.0592 -0.0217
S_FRH -0.0874 -0.0478 -0.0396
F_MER -0.0636 -0.0231 -0.0405
F_COL -0.0714 -0.0308 -0.0406
F_NIM -0.0596 -0.0179 -0.0417
F_MIL -0.0932 -0.0501 -0.0431
F_MRH -0.0883 -0.0427 -0.0456
S_MIL -0.0507 -0.0042 -0.0464
F_TOU -0.0683 -0.0197 -0.0486
F_DER -0.1072 -0.0530 -0.0541
F_FRH -0.0760 -0.0159 -0.0601
F_BUT -0.0955 -0.0299 -0.0655

Difference in mean Tajima’s D between empirical and simulateddata sets.

pop empirical neut difference
W_USR -0.0790 -0.3385 0.2594
S_BUT -0.0201 -0.2712 0.2512
L_USR -0.2897 -0.3385 0.0488
F_MIL -0.2712 -0.3066 0.0354
S_DER -0.2649 -0.2896 0.0247
F_MKH -0.2896 -0.3125 0.0229
F_FRH -0.2896 -0.2896 0.0000
S_FRH -0.3414 -0.3414 0.0000
F_BUT -0.3066 -0.2824 -0.0242
F_MRH -0.3105 -0.2761 -0.0344
S_MIL -0.1747 -0.1384 -0.0364
F_DER -0.3634 -0.3228 -0.0406
F_STN -0.3071 -0.2595 -0.0476
F_MER -0.3337 -0.2108 -0.1229
F_TOU -0.3125 -0.1879 -0.1246
F_NIM -0.3337 -0.1879 -0.1459
F_COL -0.3536 -0.2007 -0.1529

Typically the mean is lower than the median but we are more interested in genome-wide patterns so the median is more informative.

The median lower in the empirical data sets compared to the simulated data set under mutation drift equilibrium in the following populations.

  • F_COL (hatchery)
  • F_MIL
  • F_BUT
  • F_NIM (hatchery)
  • F_MRH (merced river hatchery)
  • F_MER (merced river)

Very small to no difference in the medians for the following populations:

  • F_DER
  • F_FRH (hatchery)
  • S_MIL

The mean and median are higher in the empirical data sets compared to the simulated data set under mutation drift equilibrium in the following populations

  • F_MKH (hatchery)
  • F_STN
  • F_TOU
  • L_USR
  • W_USR (positive mean value; biggest difference in medians)
  • S_DER
  • S_BUT (positive mean value; second biggest difference in medians)
  • S_FRH (median essentially the same)

3.9 Fixed loci

Fixed loci are loci that are not polymorphic for a given set of individuals, i.e. all individuals are homozygous for the same allele. The global allele frequency/counts is the allele frequency or count for a given locus across all individuals.

Differences in sample size may create bias for this comparison, as sample sizes that are (too) low can result in not all alleles being present in a population being sampled in a sufficiently representative manner, which can lead loci to appear fixed even though they do have rare alleles that were simply not sampled. Sample sizes are quite variable across run types, but for individuals grouped by run type withing tributaries, the sample sizes do become much more comparable.

Comparison of fixed loci within tributaries within runs

Identify number of fixed loci within individuals grouped by tributary and run type.

Number of fixed alleles for individuals of each tributarygrouped by run type (using rarefied allele counts).

GRP n
F_COL 3835
F_MIL 3574
F_DER 4898
F_BUT 4702
F_FRH 4385
F_NIM 4369
F_MKH 3788
F_STN 3609
F_TOU 3655
F_MRH 3504
F_MER 4191
L_USR 6416
W_USR 3853
S_MIL 4910
S_DER 5179
S_BUT 6210
S_FRH 4423

Identify loci that are fixed in more than on set of individuals grouped by tributary and run type: The set size (horizontal green bars) indicates the total number of loci fixed in a given location, the intersection size (vertical orange bars) indicates the number of loci fixed only in a single location (single blue dot) or in two, three, or four locations (indicated by blue dots connected by line).

Calculate allelic richness across all individuals in a data set.

# calculate global allelic richness
setPop(gen) <- ~OVERALL

# calculate allele counts using rarefaction
dat <- genind2hierfstat(gen)

df <- allelic.richness(dat,
                       diploid = TRUE)

ar <- as.data.frame(df$Ar) %>%
  rownames_to_column("LOCUS") %>%
  select(-3)

colnames(ar) <- c("LOCUS", "OVERALL")

write_delim(ar, "results/rarefied.allelecount", delim = "\t")

Determine the global rarefied allele counts for loci that are fixed in at least one group.

Distribution of global rarefied allele counts for loci fixed across individuals grouped by tributary within runs.

Distribution of global rarefied allele counts for loci fixed across individuals grouped by tributary within runs.

Table S11: Comparison of mean, median, 25th and 75th quantileof global diversity at fixed loci for individuals grouped byrun/tributary.

GRP mean median Q1 Q3
F_COL 1.20 1.13 1.07 1.30
F_MIL 1.19 1.13 1.07 1.29
F_DER 1.30 1.19 1.07 1.46
F_BUT 1.28 1.19 1.07 1.44
F_FRH 1.26 1.14 1.07 1.40
F_NIM 1.25 1.14 1.07 1.39
F_MKH 1.20 1.13 1.07 1.29
F_STN 1.18 1.13 1.07 1.25
F_TOU 1.19 1.13 1.07 1.29
F_MRH 1.19 1.13 1.07 1.29
F_MER 1.24 1.13 1.07 1.36
L_USR 1.46 1.35 1.12 1.79
W_USR 1.31 1.19 1.07 1.48
S_MIL 1.36 1.24 1.07 1.59
S_DER 1.38 1.25 1.07 1.63
S_BUT 1.41 1.30 1.07 1.70
S_FRH 1.28 1.19 1.07 1.43

Compare the distribution of fixed loci across the genome.

Proportion of loci on chromosome of fixed loci per chromosomefor individuals grouped by run type.

CHR total F_COL F_MIL F_DER F_BUT F_FRH F_NIM F_MKH F_STN F_TOU F_MRH F_MER L_USR W_USR S_MIL S_DER S_BUT S_FRH
1 8167 2.8 2.4 3.2 3.3 2.9 3.1 2.9 2.5 2.5 2.6 3.1 4.7 2.9 3.6 3.7 4.2 3.0
2 4869 2.4 2.1 3.1 2.9 2.9 2.6 2.1 2.3 2.2 2.4 2.5 4.1 2.5 3.1 3.2 4.0 3.1
3 5253 2.7 2.5 3.6 3.4 3.2 3.1 2.7 2.6 2.6 2.3 3.0 4.6 2.7 3.7 3.6 4.5 3.0
4 5657 2.4 2.2 3.3 2.9 2.7 2.7 2.4 2.3 2.2 2.2 2.5 3.5 2.4 3.1 3.0 3.7 2.6
5 6268 2.8 2.6 3.6 3.5 3.1 3.3 2.8 2.6 2.6 2.4 3.0 4.4 2.6 3.4 3.7 4.1 3.1
6 6408 2.5 2.4 3.4 3.1 2.9 2.9 2.7 2.2 2.4 2.4 2.8 4.3 2.5 3.2 3.3 3.8 3.0
7 5707 2.5 2.2 3.2 3.0 2.8 2.7 2.3 2.2 2.5 2.1 2.6 4.2 2.4 2.9 3.4 3.7 2.6
8 6666 2.3 2.1 2.9 3.0 2.7 3.0 2.5 2.5 2.3 2.3 2.7 4.3 2.4 3.0 3.4 4.3 3.1
9 6000 2.8 2.7 3.3 3.3 3.2 3.2 2.6 2.7 2.6 2.2 2.8 4.2 2.6 3.3 3.4 4.0 3.1
10 5351 2.5 2.2 3.1 2.9 2.7 2.6 2.1 2.1 2.2 2.1 2.8 3.9 2.4 3.3 3.5 3.8 2.7
11 3095 2.8 2.3 3.1 3.3 3.2 2.8 2.7 2.4 2.5 2.5 2.6 4.9 2.6 3.2 3.4 4.4 3.1
12 5620 2.0 1.8 2.5 2.4 2.2 2.2 1.8 1.7 1.9 1.9 2.1 3.3 2.1 2.6 2.6 3.2 2.3
13 5979 2.3 2.2 3.1 3.4 2.7 2.9 2.6 2.3 2.4 2.2 2.8 4.1 2.6 3.1 3.2 4.2 2.9
14 4284 2.5 2.4 3.5 3.2 2.7 2.6 2.3 2.5 2.7 2.3 2.9 4.7 2.5 3.5 3.5 4.4 2.8
15 3245 1.9 2.3 2.9 3.0 2.9 2.7 1.8 2.0 2.0 2.2 2.3 3.9 2.4 2.9 3.5 3.9 3.0
16 5093 2.4 2.5 3.3 3.0 2.9 2.7 2.5 2.4 2.4 2.4 2.7 4.3 2.8 3.4 3.4 3.9 2.7
17 2070 2.2 2.2 2.7 3.0 2.6 2.3 2.1 2.1 2.1 2.3 2.7 4.2 2.4 2.8 3.0 3.9 2.6
18 2958 2.5 2.2 3.0 2.8 2.7 2.9 2.3 2.6 2.3 2.3 2.8 4.1 2.5 3.2 3.5 4.2 2.6
19 5071 2.5 2.5 3.3 3.2 2.9 2.9 2.6 2.5 2.4 2.4 2.8 4.1 2.7 3.7 3.8 4.8 2.9
20 3797 2.5 2.4 3.5 2.9 2.9 2.8 2.4 2.3 2.6 2.5 2.9 4.6 2.3 3.3 3.6 4.2 3.1
21 2937 2.8 2.9 3.4 2.9 3.5 2.8 2.7 2.8 2.7 2.6 3.0 4.3 2.9 3.5 4.0 4.1 2.9
22 3447 2.5 2.3 3.4 3.0 2.9 3.0 2.5 2.2 2.5 2.3 2.5 4.9 2.5 3.3 3.5 4.1 3.0
23 1756 2.7 2.1 3.2 2.9 2.7 2.5 2.5 2.1 2.2 2.0 2.5 3.8 2.3 3.2 3.2 3.5 2.9
24 2314 2.1 2.0 3.2 3.0 2.7 2.5 2.5 2.2 2.0 2.0 2.5 3.9 2.4 3.1 3.2 4.3 3.0
25 2980 2.4 2.4 3.1 3.0 2.6 3.1 2.2 2.0 2.3 2.2 2.9 4.3 2.3 3.2 3.8 4.3 2.6
26 4047 2.7 2.5 3.5 3.1 3.1 3.1 2.9 2.6 2.3 2.4 2.7 4.8 2.5 3.2 3.6 4.3 3.0
27 1997 2.8 2.3 3.0 2.8 3.0 3.1 2.6 2.5 2.6 2.4 3.2 3.3 2.5 3.2 2.9 3.5 3.1
28 3291 2.2 2.1 3.1 2.9 2.4 2.8 2.3 2.5 2.4 2.1 2.4 4.0 2.6 2.9 3.1 3.4 2.8
29 2272 3.0 2.6 3.5 3.7 3.3 3.2 2.8 2.6 3.0 2.3 2.9 3.9 2.8 3.4 3.5 4.6 3.2
30 3383 3.0 3.0 3.4 3.2 3.5 3.3 3.0 2.7 2.7 2.5 3.6 4.8 2.9 3.4 3.8 4.9 3.2
31 2772 2.3 2.2 2.8 2.8 2.5 2.8 2.4 2.2 1.9 1.8 2.7 3.5 2.3 2.6 2.8 3.7 2.7
32 1274 1.3 1.6 2.0 2.0 2.1 1.6 1.6 1.7 1.6 1.6 1.9 3.1 1.8 2.4 2.2 3.5 2.4
33 3255 2.1 2.1 2.4 2.8 2.4 2.3 2.2 2.0 2.0 1.9 2.2 3.8 1.8 2.7 2.9 3.6 2.4
34 1094 2.0 1.6 2.2 2.2 2.4 2.9 1.9 1.8 2.3 2.2 2.4 2.8 2.3 2.7 2.8 3.0 2.8

Determine the null distribution of the proportion of loci expected to be fixed per chromosome if fixed loci are randomly distributed among chromosomes. Maintain the total number of loci fixed for a run/tributary group and the total number of loci per chromosome in the data by shuffling chromosome designations across loci. Then calculate the proportion of loci fixed on a chromosome for each run/tributary group and determining if the observed value falls is higher than the 5th and lower than the 95th percentile; determine significance as observed value lying outside observed distribution.

obs <- tidy_ar %>%
   mutate(FIXED = ifelse(AR == 1, TRUE, FALSE)) %>%
   mutate(GRP = ordered(GRP, levels = trib_runs)) %>%
   left_join(contigs) %>%
    filter(!is.na(CHROMOSOME)) %>%
    select(GRP, LOCUS, FIXED, CHROMOSOME)

reps <- 1000

sim <- list()

for(i in 1:reps){

    sim[[i]] <- obs %>%
        group_by(GRP) %>%
        mutate(SIM = sample(CHROMOSOME, replace = FALSE)) %>%
        ungroup() %>%
        filter(FIXED == TRUE) %>%
        count(GRP, SIM) %>%
        left_join(n_chrom, by = c("SIM" = "CHROMOSOME")) %>%
        mutate(percent = (n/total)*100)
}

null_dist <- ldply(sim, data.frame) %>%
    group_by(GRP, SIM) %>%
    summarize(min = min(percent),
              max = max(percent)) %>%
    ungroup() %>%
    rename(CHROMOSOME = SIM) %>%
    left_join(pattern) %>%
    select(GRP, CHROMOSOME, total, n, percent, min, max) %>%
    mutate(sign = ifelse(percent > min & percent < max, "expected", "significant"))

Compare observed and simulated distributions.

Table S12: Chromosomes with significantly higher/lower thanexpected (outside simulated null distribution) number of fixed loci foreach run/tributary group.

GRP CHROMOSOME total n percent min max
F_MIL 9 6000 161 2.68 1.62 2.65
F_DER 33 3255 77 2.37 2.58 4.09
F_FRH 9 6000 191 3.18 2.08 3.13
L_USR 19 5071 208 4.10 4.30 5.72
L_USR 24 2314 91 3.93 3.93 6.09
W_USR 33 3255 59 1.81 1.90 3.29
S_FRH 19 5071 146 2.88 2.92 4.24

Table S13: Mean +/- sd, minimum, and maximum percent of lociper chromosome that are fixed across a group of individuals with thesame run type for a given tributary.

GRP mean sd min max
F_COL 2.45 0.35 1.26 3.04
F_MIL 2.30 0.30 1.57 3.02
F_DER 3.12 0.38 2.04 3.59
F_BUT 2.99 0.34 1.96 3.65
F_FRH 2.82 0.32 2.12 3.47
F_NIM 2.79 0.35 1.57 3.31
F_MKH 2.42 0.33 1.65 2.99
F_STN 2.31 0.28 1.73 2.83
F_TOU 2.35 0.29 1.57 2.99
F_MRH 2.24 0.22 1.57 2.58
F_MER 2.70 0.32 1.88 3.58
L_USR 4.10 0.51 2.83 4.91
W_USR 2.46 0.25 1.81 2.90
S_MIL 3.15 0.32 2.43 3.69
S_DER 3.33 0.38 2.20 3.98
S_BUT 4.00 0.43 3.02 4.94
S_FRH 2.86 0.24 2.35 3.21

3.10 Singletons

Singletons are alleles that only occur in a single individual. This could be a locus that is only variable in one individual or a locus that has multiple alleles though one (or more) of those alleles are extremely rare occurring in only one individual.

# group all individuals in single group
setPop(gen) <- ~OVERALL

# calculate basic stats
dat <- genind2hierfstat(gen)
stats <- basic.stats(dat)

# extract & format allele frequencies per locus into single df
f <- stats$pop.freq

freq <- list()

for(l in names(f)){

freq[[l]] <- as.data.frame(f[[l]]) %>%
  filter(Var2 == 1) %>%
  rename(ALLELE = x,
         FRQ = Freq) %>%
  select(ALLELE, FRQ)

}

freq <- ldply(freq, data.frame) %>%
  rename(LOCUS = `.id`)

# identify singletons
Nind <- length(indNames(gen))

min <- round(1/(2*Nind), digits = 4)

singletons <- freq %>%
  filter(FRQ <= min) %>%
  mutate(ALLELE = as.integer(ALLELE))

Total number of alleles across all loci in the data set is 29873, the total number of singletons is 363 (1.2% of loci) exhibit at least one singleton.

Number of loci with x singletons.

singletons loci
1 347
2 8

Compare distribution of loci with singletons across the genome to random null distribution.

Chromosomes with significantly higher/lower than expected(outside simulated null distribution) number of loci with observedsingletons.

CHROMOSOME min max n

Determine the number of singletons individuals in each group carry.

Comparison of number of singletons per individual (individual circles); for individuals grouped by tributary and run type.

Comparison of number of singletons per individual (individual circles); for individuals grouped by tributary and run type.

Table S14: Mean, standard deviation, median, 25th and 27thquantile of the number of singletons observer per individual forindividuals grouped by run/tributary.

RUN_LOC mean sd median Q1 Q3
F_COL 1.55 1.00 1 1.00 2.00
F_MIL 1.38 0.65 1 1.00 2.00
F_DER 1.67 1.12 1 1.00 2.00
F_BUT 1.33 0.82 1 1.00 1.00
F_FRH 1.71 0.92 1 1.00 2.00
F_NIM 2.21 1.67 1 1.00 3.75
F_MKH 2.00 0.78 2 1.25 2.75
F_STN 1.00 0.00 1 1.00 1.00
F_TOU 1.36 0.63 1 1.00 1.75
F_MRH 1.71 0.76 2 1.00 2.00
F_MER 1.47 0.80 1 1.00 2.00
L_USR 2.23 1.54 2 1.00 2.00
W_USR 1.78 1.39 1 1.00 2.00
S_MIL 1.82 1.40 1 1.00 2.00
S_DER 3.00 4.64 2 1.00 2.75
S_BUT 4.00 3.46 3 2.50 4.50
S_FRH 1.25 0.50 1 1.00 1.25

3.11 Loci variable in only one group of individuals (“private polymorphisms”)

Identify loci that are polymorphic in one set of individuals (i.e. \(N(alleles) > 1\)) but that locus is not variable in any other set of individuals (i.e. fixed across all other sets of loci). The polymorphisms must occur in at least one individual of a group (i.e. singletons) but could be variable in multiple individuals, the allele other groups are fixed for may also occur in the group it is variable at, i.e. these are not necessarily private alleles, rather they are loci exclusively polymorphic in a single group.

Identify loci only variable among individuals from the same tributary by run type (private polymorphisms)

tidy_ar <- read_delim("results/trib_run_rarefied.allelecount", delim = "\t") %>%
  gather(key = GRP, AR, 2:18) %>%
  mutate(GRP = ordered(GRP, levels = trib_runs)) %>%
  rename(N_ALLELES = AR)

pops <- unique(tidy_ar$GRP)

# df for results
results <- setNames(data.frame(matrix(ncol = 2, nrow = 0)), 
                    c("LOCUS", "POLYMORPHIC")) %>%
  mutate(LOCUS = as.character(LOCUS),
         POLYMORPHIC = as.character(POLYMORPHIC))

# loop over to identify loci polymorphic in one pop but fixed in all others
for(p in pops){

# find loci polymorphic in pop
polymorphic <- tidy_ar %>%
  filter(N_ALLELES > 1 & GRP == p)

# find loci fixed in all other pops
fixed <- tidy_ar %>%
  filter(!GRP == p) %>%
  group_by(LOCUS) %>%
  count(N_ALLELES == 1) %>%
  filter(`N_ALLELES == 1` == TRUE) %>%
  filter(n == length(pops)-1) %>%
  mutate(POLYMORPHIC = p) %>%
  select(LOCUS, POLYMORPHIC)

results <- bind_rows(results, fixed)

}

pp <- results %>%
    count(LOCUS) %>%
    filter(n == 1)

results <- results %>%
    filter(LOCUS %in% pp$LOCUS) 

Determine differences in number of loci with private polymorphisms among run/tributary groups.

Number of loci polymorphic in individuals of the same run typewithin a tributary that are not variable in individuals of any othergroup based on rarefied allele counts.

POLYMORPHIC n
W_USR 153
S_DER 97
F_MRH 89
S_FRH 84
F_MIL 83
F_TOU 82
L_USR 70
S_MIL 70
F_COL 57
F_STN 55
F_MKH 54
F_DER 49
F_FRH 47
F_MER 46
F_BUT 44
F_NIM 33
S_BUT 18

Compare the distribution of exclusively polymorphic loci across the genome.

Determine the percent of loci on a chromosome that are only polymorphic for a set of individuals from the same tributary of the same run type.

Percent loci on a chromosome that are only polymorphic for aset of individuals from the same tributary of the same runtype.

CHROMOSOME total F_COL F_MIL F_DER F_BUT F_FRH F_NIM F_MKH F_STN F_TOU F_MRH F_MER L_USR W_USR S_MIL S_DER S_BUT S_FRH
1 8167 0.09 0.10 0.04 0.04 0.04 0.04 0.02 0.04 0.07 0.10 0.04 0.05 0.07 0.07 0.11 0.04 0.02
2 4869 0.02 0.04 NA 0.02 0.04 0.02 0.04 0.02 0.08 0.06 NA NA 0.12 NA 0.02 NA 0.02
3 5253 0.04 0.04 0.02 0.04 0.02 NA 0.02 0.04 0.02 0.15 0.06 0.04 0.13 0.06 0.06 NA 0.06
4 5657 0.02 0.02 0.05 0.04 0.02 0.02 0.07 0.11 0.04 0.05 0.04 0.04 0.12 0.02 0.05 NA 0.05
5 6268 0.03 0.10 0.02 0.03 NA 0.03 NA 0.02 0.08 0.11 0.10 0.05 0.16 NA 0.11 0.02 0.10
6 6408 0.05 0.06 0.02 0.02 0.03 0.05 0.02 0.05 0.06 0.02 0.02 0.05 0.09 0.08 0.08 0.02 0.05
7 5707 0.05 0.07 0.02 NA 0.07 NA 0.02 0.02 0.04 0.12 0.07 0.04 0.18 0.07 0.07 0.04 0.07
8 6666 0.03 0.02 0.02 0.03 0.05 0.05 0.03 0.02 0.05 0.05 NA 0.08 0.02 0.05 0.05 0.03 NA
9 6000 NA 0.10 0.03 0.03 0.03 0.03 0.05 0.03 0.05 0.12 0.07 0.02 0.13 0.07 0.05 NA 0.05
10 5351 0.02 0.06 0.04 0.04 NA 0.04 0.06 NA 0.02 0.02 0.02 0.09 0.11 NA 0.07 NA 0.09
11 3095 0.03 0.06 0.10 NA 0.03 0.03 0.10 NA 0.06 0.03 NA NA NA 0.03 0.06 0.03 0.03
12 5620 0.02 0.07 0.02 0.04 0.02 NA 0.04 0.05 0.05 0.05 0.02 0.02 0.07 0.05 0.11 NA 0.04
13 5979 0.05 0.05 NA NA 0.02 NA 0.03 0.03 0.02 0.08 0.03 0.03 0.07 0.05 0.03 0.02 0.07
14 4284 0.12 0.09 0.02 NA 0.05 0.02 0.02 0.02 0.05 0.09 0.02 0.05 0.07 0.05 0.02 NA 0.05
15 3245 0.03 NA 0.03 0.03 NA 0.03 0.06 0.06 0.06 NA 0.06 NA 0.12 NA 0.03 0.03 0.03
16 5093 0.04 0.02 0.04 0.06 0.06 NA 0.08 NA 0.14 0.02 0.02 0.02 0.08 NA 0.04 0.02 0.06
17 2070 NA NA NA NA 0.05 NA 0.10 0.05 0.10 0.10 NA NA 0.14 NA 0.10 0.05 NA
18 2958 NA 0.03 0.14 0.10 NA NA NA NA NA 0.03 0.07 0.03 0.14 0.07 0.10 NA 0.03
19 5071 0.08 NA 0.06 0.04 0.12 0.04 NA 0.04 0.12 0.06 NA 0.04 0.04 0.02 0.04 NA 0.04
20 3797 NA 0.03 0.05 NA NA NA 0.05 0.03 NA 0.03 NA 0.03 0.16 0.05 0.08 NA 0.08
21 2937 0.03 0.03 0.03 0.03 0.07 NA 0.03 0.03 0.07 0.03 0.03 0.17 0.03 0.03 NA NA 0.10
22 3447 NA 0.03 NA 0.03 0.03 NA 0.03 0.03 NA 0.09 0.03 0.12 0.03 0.09 0.03 NA 0.09
23 1756 0.06 0.06 0.06 NA NA NA NA NA NA 0.06 NA 0.17 0.17 NA NA NA 0.11
24 2314 NA 0.04 NA 0.04 0.04 NA 0.04 0.09 0.13 0.04 0.04 NA 0.04 NA 0.09 NA 0.09
25 2980 0.10 NA NA NA 0.03 NA 0.03 0.10 NA NA NA 0.03 0.13 0.07 0.07 NA 0.07
26 4047 0.02 0.05 0.07 0.02 0.05 NA 0.05 NA 0.10 0.05 0.02 0.05 0.27 0.05 0.07 NA 0.02
27 1997 0.10 0.15 0.10 NA 0.05 NA 0.05 0.05 0.05 NA NA 0.05 0.10 0.05 0.05 0.05 0.05
28 3291 NA 0.09 0.06 0.03 0.06 NA 0.03 NA NA 0.15 0.06 NA 0.03 0.06 0.12 NA 0.03
29 2272 0.04 0.04 0.04 0.04 NA NA NA 0.13 NA NA 0.09 NA 0.09 0.04 0.09 NA 0.04
30 3383 0.03 0.06 NA 0.15 NA NA 0.03 0.09 0.03 0.03 NA 0.06 0.06 0.03 0.09 0.03 0.06
31 2772 0.07 0.14 0.04 0.04 0.04 0.14 NA 0.07 NA NA 0.04 0.04 0.04 0.07 NA NA NA
32 1274 0.08 NA NA NA NA 0.08 NA NA NA 0.08 NA 0.08 0.08 NA 0.08 NA NA
33 3255 NA NA NA NA NA 0.03 0.03 NA 0.03 NA 0.06 0.06 0.12 0.06 0.12 NA 0.06
34 1094 0.09 0.18 NA 0.09 0.09 NA NA NA 0.09 NA NA 0.18 0.09 NA NA NA NA

Generate null distribution to determine if some groups exhibit more/less than the expected number of loci with exclusive polymorphisms for sets of individuals.

Chromosomes with significantly higher/lower than expected(outside simulated null distribution) number of private polymorhisms foreach run/tributary group

GRP CHROMOSOME total n percent min max sign
F_COL 2 4869 1 0.0205 0.0205 0.1438 lower
F_COL 4 5657 1 0.0177 0.0177 0.1414 lower
F_COL 10 5351 1 0.0187 0.0187 0.1308 lower
F_COL 11 3095 1 0.0323 0.0323 0.1939 lower
F_COL 12 5620 1 0.0178 0.0178 0.1246 lower
F_COL 15 3245 1 0.0308 0.0308 0.1849 lower
F_COL 21 2937 1 0.0340 0.0340 0.2043 lower
F_COL 23 1756 1 0.0569 0.0569 0.2847 lower
F_COL 26 4047 1 0.0247 0.0247 0.1483 lower
F_COL 29 2272 1 0.0440 0.0440 0.2201 lower
F_COL 30 3383 1 0.0296 0.0296 0.2069 lower
F_COL 32 1274 1 0.0785 0.0785 0.2355 lower
F_COL 34 1094 1 0.0914 0.0914 0.3656 lower
F_MIL 4 5657 1 0.0177 0.0177 0.1768 lower
F_MIL 8 6666 1 0.0150 0.0150 0.1650 lower
F_MIL 16 5093 1 0.0196 0.0196 0.1963 lower
F_MIL 18 2958 1 0.0338 0.0338 0.2028 lower
F_MIL 20 3797 1 0.0263 0.0263 0.2107 lower
F_MIL 21 2937 1 0.0340 0.0340 0.2383 lower
F_MIL 22 3447 1 0.0290 0.0290 0.2611 lower
F_MIL 23 1756 1 0.0569 0.0569 0.3417 lower
F_MIL 24 2314 1 0.0432 0.0432 0.2593 lower
F_MIL 29 2272 1 0.0440 0.0440 0.2641 lower
F_DER 3 5253 1 0.0190 0.0190 0.1333 lower
F_DER 5 6268 1 0.0160 0.0160 0.1117 lower
F_DER 6 6408 1 0.0156 0.0156 0.1248 lower
F_DER 7 5707 1 0.0175 0.0175 0.1227 lower
F_DER 8 6666 1 0.0150 0.0150 0.1350 lower
F_DER 12 5620 1 0.0178 0.0178 0.1423 lower
F_DER 14 4284 1 0.0233 0.0233 0.1867 lower
F_DER 15 3245 1 0.0308 0.0308 0.1849 lower
F_DER 21 2937 1 0.0340 0.0340 0.1702 lower
F_DER 23 1756 1 0.0569 0.0569 0.2847 lower
F_DER 29 2272 1 0.0440 0.0440 0.1761 lower
F_DER 31 2772 1 0.0361 0.0361 0.1804 lower
F_BUT 2 4869 1 0.0205 0.0205 0.1232 lower
F_BUT 6 6408 1 0.0156 0.0156 0.1092 lower
F_BUT 15 3245 1 0.0308 0.0308 0.1541 lower
F_BUT 21 2937 1 0.0340 0.0340 0.1702 lower
F_BUT 22 3447 1 0.0290 0.0290 0.1451 lower
F_BUT 24 2314 1 0.0432 0.0432 0.2161 lower
F_BUT 26 4047 1 0.0247 0.0247 0.1235 lower
F_BUT 28 3291 1 0.0304 0.0304 0.1519 lower
F_BUT 29 2272 1 0.0440 0.0440 0.2201 lower
F_BUT 30 3383 5 0.1478 0.0296 0.1478 higher
F_BUT 31 2772 1 0.0361 0.0361 0.1804 lower
F_BUT 34 1094 1 0.0914 0.0914 0.3656 lower
F_FRH 3 5253 1 0.0190 0.0190 0.1333 lower
F_FRH 4 5657 1 0.0177 0.0177 0.1237 lower
F_FRH 11 3095 1 0.0323 0.0323 0.1939 lower
F_FRH 12 5620 1 0.0178 0.0178 0.1068 lower
F_FRH 13 5979 1 0.0167 0.0167 0.1338 lower
F_FRH 17 2070 1 0.0483 0.0483 0.2899 lower
F_FRH 22 3447 1 0.0290 0.0290 0.1451 lower
F_FRH 24 2314 1 0.0432 0.0432 0.2593 lower
F_FRH 25 2980 1 0.0336 0.0336 0.1678 lower
F_FRH 27 1997 1 0.0501 0.0501 0.2504 lower
F_FRH 31 2772 1 0.0361 0.0361 0.1443 lower
F_FRH 34 1094 1 0.0914 0.0914 0.2742 lower
F_NIM 2 4869 1 0.0205 0.0205 0.1027 lower
F_NIM 4 5657 1 0.0177 0.0177 0.0884 lower
F_NIM 11 3095 1 0.0323 0.0323 0.1616 lower
F_NIM 14 4284 1 0.0233 0.0233 0.1167 lower
F_NIM 15 3245 1 0.0308 0.0308 0.1541 lower
F_NIM 31 2772 4 0.1443 0.0361 0.1443 higher
F_NIM 32 1274 1 0.0785 0.0785 0.1570 lower
F_NIM 33 3255 1 0.0307 0.0307 0.1229 lower
F_MKH 3 5253 1 0.0190 0.0190 0.1333 lower
F_MKH 6 6408 1 0.0156 0.0156 0.1248 lower
F_MKH 7 5707 1 0.0175 0.0175 0.1051 lower
F_MKH 14 4284 1 0.0233 0.0233 0.1401 lower
F_MKH 21 2937 1 0.0340 0.0340 0.1702 lower
F_MKH 22 3447 1 0.0290 0.0290 0.1451 lower
F_MKH 24 2314 1 0.0432 0.0432 0.2161 lower
F_MKH 25 2980 1 0.0336 0.0336 0.1678 lower
F_MKH 27 1997 1 0.0501 0.0501 0.2003 lower
F_MKH 28 3291 1 0.0304 0.0304 0.1519 lower
F_MKH 30 3383 1 0.0296 0.0296 0.2069 lower
F_MKH 33 3255 1 0.0307 0.0307 0.1536 lower
F_STN 2 4869 1 0.0205 0.0205 0.1232 lower
F_STN 5 6268 1 0.0160 0.0160 0.1117 lower
F_STN 7 5707 1 0.0175 0.0175 0.1402 lower
F_STN 8 6666 1 0.0150 0.0150 0.1200 lower
F_STN 14 4284 1 0.0233 0.0233 0.1401 lower
F_STN 17 2070 1 0.0483 0.0483 0.2415 lower
F_STN 20 3797 1 0.0263 0.0263 0.1580 lower
F_STN 21 2937 1 0.0340 0.0340 0.2043 lower
F_STN 22 3447 1 0.0290 0.0290 0.1741 lower
F_STN 27 1997 1 0.0501 0.0501 0.2504 lower
F_TOU 3 5253 1 0.0190 0.0190 0.1904 lower
F_TOU 10 5351 1 0.0187 0.0187 0.1495 lower
F_TOU 13 5979 1 0.0167 0.0167 0.1505 lower
F_TOU 27 1997 1 0.0501 0.0501 0.2504 lower
F_TOU 30 3383 1 0.0296 0.0296 0.2069 lower
F_TOU 33 3255 1 0.0307 0.0307 0.2151 lower
F_TOU 34 1094 1 0.0914 0.0914 0.4570 lower
F_MRH 6 6408 1 0.0156 0.0156 0.1717 lower
F_MRH 10 5351 1 0.0187 0.0187 0.2056 lower
F_MRH 11 3095 1 0.0323 0.0323 0.2908 lower
F_MRH 16 5093 1 0.0196 0.0196 0.1963 lower
F_MRH 18 2958 1 0.0338 0.0338 0.2705 lower
F_MRH 20 3797 1 0.0263 0.0263 0.2107 lower
F_MRH 21 2937 1 0.0340 0.0340 0.2724 lower
F_MRH 23 1756 1 0.0569 0.0569 0.2847 lower
F_MRH 24 2314 1 0.0432 0.0432 0.3025 lower
F_MRH 30 3383 1 0.0296 0.0296 0.2069 lower
F_MRH 32 1274 1 0.0785 0.0785 0.3925 lower
F_MER 6 6408 1 0.0156 0.0156 0.1092 lower
F_MER 10 5351 1 0.0187 0.0187 0.1308 lower
F_MER 12 5620 1 0.0178 0.0178 0.1068 lower
F_MER 14 4284 1 0.0233 0.0233 0.1167 lower
F_MER 16 5093 1 0.0196 0.0196 0.1571 lower
F_MER 21 2937 1 0.0340 0.0340 0.1702 lower
F_MER 22 3447 1 0.0290 0.0290 0.1741 lower
F_MER 24 2314 1 0.0432 0.0432 0.1729 lower
F_MER 26 4047 1 0.0247 0.0247 0.1730 lower
F_MER 31 2772 1 0.0361 0.0361 0.1443 lower
L_USR 9 6000 1 0.0167 0.0167 0.1333 lower
L_USR 12 5620 1 0.0178 0.0178 0.1246 lower
L_USR 16 5093 1 0.0196 0.0196 0.1571 lower
L_USR 18 2958 1 0.0338 0.0338 0.2366 lower
L_USR 20 3797 1 0.0263 0.0263 0.2634 lower
L_USR 25 2980 1 0.0336 0.0336 0.2349 lower
L_USR 27 1997 1 0.0501 0.0501 0.2504 lower
L_USR 31 2772 1 0.0361 0.0361 0.2165 lower
L_USR 32 1274 1 0.0785 0.0785 0.3925 lower
W_USR 8 6666 1 0.0150 0.0150 0.2250 lower
W_USR 21 2937 1 0.0340 0.0340 0.3064 lower
W_USR 22 3447 1 0.0290 0.0290 0.2901 lower
W_USR 24 2314 1 0.0432 0.0432 0.3889 lower
W_USR 26 4047 11 0.2718 0.0247 0.2471 higher
W_USR 28 3291 1 0.0304 0.0304 0.3039 lower
W_USR 31 2772 1 0.0361 0.0361 0.3247 lower
W_USR 32 1274 1 0.0785 0.0785 0.4710 lower
W_USR 34 1094 1 0.0914 0.0914 0.6399 lower
S_MIL 4 5657 1 0.0177 0.0177 0.1237 lower
S_MIL 11 3095 1 0.0323 0.0323 0.1939 lower
S_MIL 19 5071 1 0.0197 0.0197 0.1775 lower
S_MIL 21 2937 1 0.0340 0.0340 0.2043 lower
S_MIL 27 1997 1 0.0501 0.0501 0.2504 lower
S_MIL 29 2272 1 0.0440 0.0440 0.2201 lower
S_MIL 30 3383 1 0.0296 0.0296 0.1774 lower
S_DER 2 4869 1 0.0205 0.0205 0.2054 lower
S_DER 14 4284 1 0.0233 0.0233 0.2568 lower
S_DER 15 3245 1 0.0308 0.0308 0.2465 lower
S_DER 22 3447 1 0.0290 0.0290 0.2031 lower
S_DER 27 1997 1 0.0501 0.0501 0.2504 lower
S_DER 32 1274 1 0.0785 0.0785 0.3925 lower
S_BUT 5 6268 1 0.0160 0.0160 0.0638 lower
S_BUT 6 6408 1 0.0156 0.0156 0.0624 lower
S_BUT 11 3095 1 0.0323 0.0323 0.0969 lower
S_BUT 13 5979 1 0.0167 0.0167 0.0669 lower
S_BUT 15 3245 1 0.0308 0.0308 0.0924 lower
S_BUT 16 5093 1 0.0196 0.0196 0.0785 lower
S_BUT 17 2070 1 0.0483 0.0483 0.1449 lower
S_BUT 27 1997 1 0.0501 0.0501 0.1502 lower
S_BUT 30 3383 1 0.0296 0.0296 0.0887 lower
S_FRH 2 4869 1 0.0205 0.0205 0.1643 lower
S_FRH 11 3095 1 0.0323 0.0323 0.2585 lower
S_FRH 15 3245 1 0.0308 0.0308 0.2157 lower
S_FRH 18 2958 1 0.0338 0.0338 0.2028 lower
S_FRH 26 4047 1 0.0247 0.0247 0.1730 lower
S_FRH 27 1997 1 0.0501 0.0501 0.2504 lower
S_FRH 28 3291 1 0.0304 0.0304 0.2127 lower
S_FRH 29 2272 1 0.0440 0.0440 0.2641 lower

Table S15: Comparison across run/tributary group of the mean+/- sd, minimum, and maximum percent of loci on a chromosome that areonly polymorphic for a set of individuals from the same tributary of thesame run type.

POLYMORPHIC mean sd min max
F_COL 0.05 0.03 0.02 0.12
F_MIL 0.07 0.04 0.02 0.18
F_DER 0.05 0.03 0.02 0.14
F_BUT 0.04 0.03 0.02 0.15
F_FRH 0.05 0.02 0.02 0.12
F_NIM 0.04 0.03 0.02 0.14
F_MKH 0.04 0.02 0.02 0.10
F_STN 0.05 0.03 0.02 0.13
F_TOU 0.06 0.03 0.02 0.14
F_MRH 0.07 0.04 0.02 0.15
F_MER 0.05 0.02 0.02 0.10
L_USR 0.06 0.05 0.02 0.18
W_USR 0.10 0.05 0.02 0.27
S_MIL 0.05 0.02 0.02 0.09
S_DER 0.07 0.03 0.02 0.12
S_BUT 0.03 0.01 0.02 0.05
S_FRH 0.06 0.03 0.02 0.11

Table S16: Number of chromosomes with significantly more/lessloci with private polymorphisms than expected.

GRP sign n
F_COL lower 13
F_DER lower 12
F_FRH lower 12
F_MKH lower 12
F_BUT lower 11
F_MRH lower 11
F_MIL lower 10
F_STN lower 10
F_MER lower 10
L_USR lower 9
S_BUT lower 9
W_USR lower 8
S_FRH lower 8
F_NIM lower 7
F_TOU lower 7
S_MIL lower 7
S_DER lower 6
F_BUT higher 1
F_NIM higher 1
W_USR higher 1

Table S17: Number of run/trib groups a chromosome has beenflagged as having significantly more/less loci with privatepolymorphisms than expected.

CHROMOSOME sign n
21 lower 10
27 lower 9
22 lower 8
11 lower 7
15 lower 7
24 lower 7
2 lower 6
6 lower 6
14 lower 6
29 lower 6
30 lower 6
31 lower 6
32 lower 6
4 lower 5
12 lower 5
16 lower 5
34 lower 5
3 lower 4
8 lower 4
10 lower 4
18 lower 4
20 lower 4
23 lower 4
26 lower 4
28 lower 4
5 lower 3
7 lower 3
13 lower 3
17 lower 3
25 lower 3
33 lower 3
9 lower 1
19 lower 1
26 higher 1
30 higher 1
31 higher 1

Identify loci only variable among individuals from the same run type (private polymorphisms)

For this analysis we will exclude hatchery individuals and combine wild individuals from each run and the subset to 21 individuals (number of individuals in Late Fall run data set) to identify the number of alleles that occur only in individuals of a given run.

# wild populations
wild <- strata %>%
  filter(SOURCE == "WILD")

# subset genind object
gen_wild <- gen[row.names(gen@tab) %in% wild$LIB_ID]

# by tributary within run
setPop(gen_wild) <- ~RUN

# calculate allele counts using rarefaction ----
dat <- genind2hierfstat(gen_wild)

df <- allelic.richness(dat,
                       diploid = TRUE)

df <- as.data.frame(df$Ar) %>%
  rownames_to_column("LOCUS")

write_delim(df, "results/run_rarefied.allelecount", delim = "\t")

# identify private polymorphisms ----
tidy_ar <-df %>%
  gather(key = GRP, AR, 2:5) %>%
  mutate(GRP = ordered(GRP, levels = run_levels)) %>%
  rename(N_ALLELES = AR)

pops <- unique(tidy_ar$GRP)

# df for results
results <- setNames(data.frame(matrix(ncol = 2, nrow = 0)), 
                    c("LOCUS", "POLYMORPHIC")) %>%
  mutate(LOCUS = as.character(LOCUS),
         POLYMORPHIC = as.character(POLYMORPHIC))

# loop over to identify loci polymorphic in one pop but fixed in all others
for(p in pops){

# find loci polymorphic in pop
polymorphic <- tidy_ar %>%
  filter(N_ALLELES > 1 & GRP == p)

# find loci fixed in all other pops
fixed <- tidy_ar %>%
  filter(!GRP == p) %>%
  group_by(LOCUS) %>%
  count(N_ALLELES == 1) %>%
  filter(`N_ALLELES == 1` == TRUE) %>%
  filter(n == length(pops)-1) %>%
  mutate(POLYMORPHIC = p) %>%
  select(LOCUS, POLYMORPHIC)

results <- bind_rows(results, fixed)

}

pp <- results %>%
    count(LOCUS) %>%
    filter(n == 1)

results <- results %>%
    filter(LOCUS %in% pp$LOCUS) 

Determine differences in number of loci with private polymorphisms among run/tributary groups.

Number of loci polymorphic in individuals of the same run typethat are not variable in individuals of any other run based on rarefiedallele counts.

population n
Fall 1348
Late-Fall 122
Spring 580
Winter 95

3.12 Private alleles

Private alleles are alleles that only occur within a given set of individuals; this can be a locus that is only variable in one set of individuals (all other individuals are fixed for the alternate allele(s)) or it could be a locus with multiple alleles, that are variable in other sets of individuals (i.e. heterozygotes & homozygotes occur) but one (or more) of those alleles only exists within a given set of individuals. Not all individuals of the group will have the private allele, i.e. they may be more or less common.

Comparison of private alleles by tributary within run type

There is a relationship between sample size and the number of private alleles found making it different to compare levels overall (many private alleles are rare so for smaller sample sizes they might not be recovered); though it is important to note that all run/tributary groups do exhibit private alleles, i.e. harbor unique genetic material.

Comparison of sample size and number of private alleles detected in each trib/run population.

Comparison of sample size and number of private alleles detected in each trib/run population.

To account for this, we subsampled down to 15 individuals 100 times to determine the number of private alleles occurring in each subset of individuals.

# unset the seed to get individually different reps
set.seed(NULL)

# list for results
priv_allele_counts <- list()

# number of individuals to subset
n <- 20

# feather river spring individuals

for(i in 1:100){

  # draw up to n individuals from each population
  sample_ind <- strata %>%
    group_by(RUN_LOC) %>%
    slice_sample(n = n)
  
  # subset genind object
  gen_subset <- gen[row.names(gen@tab) %in% sample_ind$LIB_ID]
  
  # identify private alleles
  alleles_private <- poppr::private_alleles(gen_subset,
                                            alleles ~ RUN_LOC,
                                            report = "data.frame")
  # count no private alleles per population
  priv_allele_counts[[i]] <- alleles_private %>%
    filter(count > 0) %>%
    count(population) %>%
    mutate(rep = i)
  
}

results <- ldply(priv_allele_counts, data.frame)

write_delim(results, "results/priv-alleles_15ind-100reps.txt",
            delim = "\t")

Calculate the mean number of private alleles across all bootstraps.

Compariso of sample size and number of private alleles detected in each trib/run population after adjusting for sampling size by bootstrapping to 15 individuals.

Compariso of sample size and number of private alleles detected in each trib/run population after adjusting for sampling size by bootstrapping to 15 individuals.

Number of private alleles occurring in each set of individualsgrouped by run type within tributaries. All populations except for S_FRHwere subsampled down to 15 individuals.

population mean sd
F_COL 140.21 9.706079
F_MIL 130.80 6.843119
F_DER 140.63 6.539167
F_BUT 90.48 7.576519
F_FRH 139.50 13.858958
F_NIM 123.39 11.653148
F_MKH 115.67 11.751686
F_STN 104.04 7.961930
F_TOU 94.75 8.867572
F_MRH 132.01 6.330135
F_MER 93.19 8.604315
L_USR 134.58 9.584478
W_USR 122.99 9.408244
S_MIL 193.20 10.078480
S_DER 203.80 38.640273
S_BUT 185.43 24.757513
S_FRH 59.46 4.875045

Populations with high standard deviations are either hatchery individuals or have large number of private polymorphisms (which set of individuals that you pull out can have a larger impact on number of private alleles depending on whether you happen to pull a set with a large number of private polymorphisms or not).

Compare the distribution of loci with private alleles across the genome.

Determine the % loci of a given chromosome that are only polymorphic for a set of individuals from the same tributary of the same run type.

CHROMOSOME total F_COL F_MIL F_DER F_BUT F_FRH F_NIM F_MKH F_STN F_TOU F_MRH F_MER L_USR W_USR S_MIL S_DER S_BUT S_FRH
1 8167 0.12 0.09 0.09 0.05 0.16 0.23 0.12 0.06 0.02 0.06 0.07 0.04 0.07 0.10 0.17 0.17 0.04
2 4869 0.08 0.06 0.02 0.04 0.10 0.12 0.06 NA 0.08 0.06 0.06 0.02 NA 0.02 0.14 0.12 NA
3 5253 0.23 0.08 0.04 NA 0.08 0.10 0.11 0.06 0.08 0.04 0.08 0.08 0.06 0.13 0.19 0.17 NA
4 5657 0.11 0.05 0.05 0.05 0.04 0.09 0.02 0.05 0.09 0.09 0.14 0.14 0.16 0.05 0.19 0.09 0.02
5 6268 0.22 0.02 0.05 0.03 0.11 0.11 0.05 0.11 0.02 0.02 0.03 0.10 0.16 0.06 0.18 0.14 0.02
6 6408 0.05 0.05 0.06 0.06 0.11 0.14 0.16 0.05 0.03 0.02 0.09 0.06 0.08 0.11 0.17 0.14 0.03
7 5707 0.14 0.09 0.02 NA 0.07 0.07 0.07 0.09 0.04 0.04 0.04 0.09 0.11 0.09 0.25 0.11 0.07
8 6666 0.09 0.09 0.05 0.08 0.09 0.11 0.18 0.11 0.09 0.03 0.06 0.03 0.14 0.12 0.11 0.11 0.06
9 6000 0.23 0.07 0.05 0.08 0.15 0.05 0.05 0.12 0.07 0.03 0.08 0.07 0.03 0.13 0.17 0.08 NA
10 5351 0.09 0.04 0.04 0.07 0.09 0.06 0.11 0.02 0.06 0.07 0.02 0.15 0.13 0.06 0.19 0.09 0.04
11 3095 0.03 0.03 0.03 0.13 0.06 0.13 0.13 0.06 0.16 0.13 0.03 0.03 0.03 0.03 0.13 0.10 0.10
12 5620 0.12 0.04 0.07 0.02 0.12 0.12 0.07 0.05 0.07 0.07 0.09 0.04 0.11 0.09 0.14 0.14 0.02
13 5979 0.13 0.03 NA NA 0.12 0.05 0.07 0.07 0.03 NA 0.08 0.08 0.03 0.12 0.12 0.05 0.02
14 4284 0.16 0.09 NA 0.05 0.14 0.07 0.16 0.05 0.09 0.05 0.05 0.07 0.07 0.05 0.16 0.05 NA
15 3245 0.06 NA 0.03 0.03 0.06 0.06 0.18 0.09 0.12 0.06 0.06 0.06 NA NA 0.25 0.06 0.06
16 5093 0.12 0.08 0.14 NA 0.06 0.18 0.08 0.02 0.16 0.04 0.04 0.10 0.04 NA 0.12 0.08 0.02
17 2070 0.10 0.05 NA NA NA 0.14 NA NA 0.10 NA 0.05 0.05 NA 0.14 0.14 0.14 0.05
18 2958 0.10 NA 0.14 0.07 0.14 NA NA 0.07 0.07 0.17 0.03 0.14 0.07 0.14 0.14 0.20 NA
19 5071 0.10 0.20 0.08 0.08 0.10 0.20 0.12 0.04 0.02 0.14 0.10 0.08 0.12 0.04 0.16 0.06 NA
20 3797 0.08 0.08 0.05 0.03 0.03 NA 0.03 0.03 0.11 0.13 0.08 0.13 0.05 0.05 0.26 0.08 NA
21 2937 0.10 0.07 0.07 0.03 0.10 0.10 0.03 0.10 0.10 0.07 0.14 0.10 0.17 0.10 0.07 0.03 NA
22 3447 0.15 0.06 0.03 0.03 0.12 0.03 0.03 0.06 0.06 0.03 0.03 0.15 0.17 0.12 0.15 0.09 NA
23 1756 0.06 NA 0.06 NA 0.17 0.23 0.11 0.11 NA 0.11 0.06 0.11 0.23 NA 0.23 0.06 NA
24 2314 0.26 0.04 0.13 0.04 0.09 0.17 0.09 0.09 0.09 NA 0.13 0.13 0.04 0.09 0.09 0.22 NA
25 2980 NA 0.17 0.03 NA 0.07 0.10 0.13 NA 0.10 0.03 0.20 0.13 0.03 0.17 0.23 0.13 NA
26 4047 0.17 0.07 0.10 0.05 0.07 0.15 0.07 0.05 0.07 0.10 0.05 0.05 0.07 0.07 0.35 0.15 NA
27 1997 0.05 0.10 NA NA 0.15 0.05 0.10 0.05 0.10 0.10 0.10 0.15 0.05 0.15 0.15 0.20 0.05
28 3291 0.21 0.12 0.03 NA 0.18 NA 0.03 0.06 0.03 0.06 0.06 0.03 NA 0.12 0.12 0.21 NA
29 2272 NA 0.04 0.04 NA 0.26 0.09 0.04 0.13 0.04 0.13 0.18 0.13 0.04 0.09 0.13 0.09 NA
30 3383 0.15 0.12 0.18 NA 0.06 0.09 0.06 NA 0.09 NA 0.09 0.09 0.06 0.06 0.09 0.12 0.06
31 2772 NA 0.04 0.04 0.18 0.22 0.04 0.22 0.04 NA 0.07 0.11 0.04 0.07 0.11 0.14 0.04 NA
32 1274 0.08 NA NA 0.08 NA NA 0.08 NA 0.08 NA NA NA 0.08 0.31 0.24 0.08 0.08
33 3255 0.06 0.03 NA 0.06 NA 0.09 0.09 0.06 0.06 0.06 0.03 0.09 0.06 0.22 0.40 0.22 0.03
34 1094 NA 0.09 0.09 NA 0.18 0.09 0.18 NA NA NA NA NA 0.18 NA 0.09 0.09 NA

Compare the mean proportion of loci per chromosome that are private polymorphisms across tributaries.

Table S18: Comparison across run/tributary group of the mean+/- sd, minimum, and maximum percent of loci on a chromosome that areonly polymorphic for a set of individuals from the same tributary of thesame run type.

population mean sd min max
F_COL 0.12 0.06 0.03 0.26
F_MIL 0.07 0.04 0.02 0.20
F_DER 0.06 0.04 0.02 0.18
F_BUT 0.06 0.04 0.02 0.18
F_FRH 0.11 0.05 0.03 0.26
F_NIM 0.11 0.05 0.03 0.23
F_MKH 0.10 0.05 0.02 0.22
F_STN 0.07 0.03 0.02 0.13
F_TOU 0.07 0.04 0.02 0.16
F_MRH 0.07 0.04 0.02 0.17
F_MER 0.08 0.04 0.02 0.20
L_USR 0.09 0.04 0.02 0.15
W_USR 0.09 0.05 0.03 0.23
S_MIL 0.10 0.06 0.02 0.31
S_DER 0.17 0.07 0.07 0.40
S_BUT 0.11 0.05 0.03 0.22
S_FRH 0.04 0.02 0.02 0.10

Generate a random null distribution to determine if loci with private alleles are non-randomly distributed across chromosomes.

Table S19: Chromosomes with significantly higher/lower thanexpected (outside simulated null distribution) number of fixed loci foreach run/tributary group

GRP CHROMOSOME total n percent min max sign
F_MIL 34 1094 1 0.0914 0.0914 0.3656 lower
F_DER 34 1094 1 0.0914 0.0914 0.3656 lower
F_NIM 34 1094 1 0.0914 0.0914 0.4570 lower
S_DER 34 1094 1 0.0914 0.0914 0.4570 lower
S_BUT 34 1094 1 0.0914 0.0914 0.3656 lower
F_MIL 33 3255 1 0.0307 0.0307 0.2458 lower
F_MER 33 3255 1 0.0307 0.0307 0.2458 lower
S_FRH 33 3255 1 0.0307 0.0307 0.1229 lower
F_COL 32 1274 1 0.0785 0.0785 0.4710 lower
F_BUT 32 1274 1 0.0785 0.0785 0.3140 lower
F_MKH 32 1274 1 0.0785 0.0785 0.3140 lower
F_TOU 32 1274 1 0.0785 0.0785 0.3140 lower
W_USR 32 1274 1 0.0785 0.0785 0.3140 lower
S_BUT 32 1274 1 0.0785 0.0785 0.4710 lower
S_FRH 32 1274 1 0.0785 0.0785 0.2355 lower
F_MIL 31 2772 1 0.0361 0.0361 0.2886 lower
F_DER 31 2772 1 0.0361 0.0361 0.2165 lower
F_BUT 31 2772 5 0.1804 0.0361 0.1804 higher
F_NIM 31 2772 1 0.0361 0.0361 0.2886 lower
F_STN 31 2772 1 0.0361 0.0361 0.2165 lower
L_USR 31 2772 1 0.0361 0.0361 0.2886 lower
S_BUT 31 2772 1 0.0361 0.0361 0.3247 lower
F_DER 30 3383 6 0.1774 0.0296 0.1774 higher
F_MIL 29 2272 1 0.0440 0.0440 0.3081 lower
F_DER 29 2272 1 0.0440 0.0440 0.3081 lower
F_MKH 29 2272 1 0.0440 0.0440 0.3961 lower
F_TOU 29 2272 1 0.0440 0.0440 0.2641 lower
W_USR 29 2272 1 0.0440 0.0440 0.3521 lower
F_DER 28 3291 1 0.0304 0.0304 0.2127 lower
F_MKH 28 3291 1 0.0304 0.0304 0.2735 lower
F_TOU 28 3291 1 0.0304 0.0304 0.2127 lower
L_USR 28 3291 1 0.0304 0.0304 0.2127 lower
F_COL 27 1997 1 0.0501 0.0501 0.4507 lower
F_NIM 27 1997 1 0.0501 0.0501 0.5008 lower
F_STN 27 1997 1 0.0501 0.0501 0.3005 lower
W_USR 27 1997 1 0.0501 0.0501 0.4006 lower
S_FRH 27 1997 1 0.0501 0.0501 0.2504 lower
F_DER 25 2980 1 0.0336 0.0336 0.2349 lower
F_MRH 25 2980 1 0.0336 0.0336 0.2685 lower
W_USR 25 2980 1 0.0336 0.0336 0.2685 lower
F_MIL 24 2314 1 0.0432 0.0432 0.3025 lower
F_BUT 24 2314 1 0.0432 0.0432 0.2593 lower
W_USR 24 2314 1 0.0432 0.0432 0.4754 lower
F_COL 23 1756 1 0.0569 0.0569 0.5125 lower
F_DER 23 1756 1 0.0569 0.0569 0.2847 lower
F_MER 23 1756 1 0.0569 0.0569 0.3417 lower
S_BUT 23 1756 1 0.0569 0.0569 0.3986 lower
F_DER 22 3447 1 0.0290 0.0290 0.2031 lower
F_BUT 22 3447 1 0.0290 0.0290 0.2031 lower
F_NIM 22 3447 1 0.0290 0.0290 0.2901 lower
F_MKH 22 3447 1 0.0290 0.0290 0.2321 lower
F_MRH 22 3447 1 0.0290 0.0290 0.2031 lower
F_MER 22 3447 1 0.0290 0.0290 0.2031 lower
F_BUT 21 2937 1 0.0340 0.0340 0.2043 lower
F_MKH 21 2937 1 0.0340 0.0340 0.3745 lower
S_BUT 21 2937 1 0.0340 0.0340 0.3405 lower
F_BUT 20 3797 1 0.0263 0.0263 0.1580 lower
F_FRH 20 3797 1 0.0263 0.0263 0.2370 lower
F_MKH 20 3797 1 0.0263 0.0263 0.2634 lower
F_STN 20 3797 1 0.0263 0.0263 0.1844 lower
F_TOU 19 5071 1 0.0197 0.0197 0.1972 lower
F_MER 18 2958 1 0.0338 0.0338 0.3043 lower
F_MIL 17 2070 1 0.0483 0.0483 0.2415 lower
F_MER 17 2070 1 0.0483 0.0483 0.2899 lower
L_USR 17 2070 1 0.0483 0.0483 0.2415 lower
S_FRH 17 2070 1 0.0483 0.0483 0.1932 lower
F_STN 16 5093 1 0.0196 0.0196 0.1571 lower
S_FRH 16 5093 1 0.0196 0.0196 0.1178 lower
F_DER 15 3245 1 0.0308 0.0308 0.1849 lower
F_BUT 15 3245 1 0.0308 0.0308 0.1541 lower
S_FRH 13 5979 1 0.0167 0.0167 0.0669 lower
F_BUT 12 5620 1 0.0178 0.0178 0.1423 lower
S_FRH 12 5620 1 0.0178 0.0178 0.0890 lower
F_COL 11 3095 1 0.0323 0.0323 0.3554 lower
F_MIL 11 3095 1 0.0323 0.0323 0.2585 lower
F_DER 11 3095 1 0.0323 0.0323 0.1939 lower
F_MER 11 3095 1 0.0323 0.0323 0.2585 lower
L_USR 11 3095 1 0.0323 0.0323 0.2908 lower
W_USR 11 3095 1 0.0323 0.0323 0.2585 lower
S_MIL 11 3095 1 0.0323 0.0323 0.2908 lower
F_STN 10 5351 1 0.0187 0.0187 0.2056 lower
F_MER 10 5351 1 0.0187 0.0187 0.1869 lower
F_DER 7 5707 1 0.0175 0.0175 0.1752 lower
F_MRH 6 6408 1 0.0156 0.0156 0.1561 lower
F_MIL 5 6268 1 0.0160 0.0160 0.1755 lower
F_TOU 5 6268 1 0.0160 0.0160 0.1755 lower
F_MRH 5 6268 1 0.0160 0.0160 0.1755 lower
S_FRH 5 6268 1 0.0160 0.0160 0.0957 lower
F_MKH 4 5657 1 0.0177 0.0177 0.1944 lower
S_FRH 4 5657 1 0.0177 0.0177 0.0884 lower
F_DER 2 4869 1 0.0205 0.0205 0.1438 lower
L_USR 2 4869 1 0.0205 0.0205 0.2054 lower
S_MIL 2 4869 1 0.0205 0.0205 0.2054 lower

Table S20: Number of run/trib groups a chromosome has beenflagged as having significantly more/less loci with exclusivepolymorphisms than expected.

CHROMOSOME sign n
11 lower 7
32 lower 7
22 lower 6
31 lower 6
27 lower 5
29 lower 5
34 lower 5
5 lower 4
17 lower 4
20 lower 4
23 lower 4
28 lower 4
2 lower 3
21 lower 3
24 lower 3
25 lower 3
33 lower 3
4 lower 2
10 lower 2
12 lower 2
15 lower 2
16 lower 2
6 lower 1
7 lower 1
13 lower 1
18 lower 1
19 lower 1
30 higher 1
31 higher 1

Number of chromosomes with significantly more/less loci withexclusive polymorphisms than expected.

GRP sign n
F_DER lower 11
S_FRH lower 9
F_MIL lower 8
F_BUT lower 7
F_MKH lower 7
F_MER lower 7
W_USR lower 6
F_STN lower 5
F_TOU lower 5
L_USR lower 5
S_BUT lower 5
F_COL lower 4
F_NIM lower 4
F_MRH lower 4
S_MIL lower 2
F_DER higher 1
F_BUT higher 1
F_FRH lower 1
S_DER lower 1

Comparison of private alleles among runs

For this analysis we will exclude hatchery individuals and combine wild individuals from each run and the subset to 21 individuals (number of individuals in Late Fall run data set) to identify the number of alleles that occur only in individuals of a given run.

# unset the seed to get individually different reps
set.seed(NULL)

# list for results
priv_allele_counts <- list()

# number of individuals to subset
n <- 21

# wild populations
wild <- strata %>%
  filter(SOURCE == "WILD")

for(i in 1:100){

  # draw up to n individuals from each population
  sample_ind <- wild %>%
    group_by(RUN) %>%
    slice_sample(n = n)
  
  # subset genind object
  gen_subset <- gen[row.names(gen@tab) %in% sample_ind$LIB_ID]
  
  # identify private alleles
  alleles_private <- poppr::private_alleles(gen_subset,
                                            alleles ~ RUN,
                                            report = "data.frame")
  
  # count no private alleles per population
  priv_allele_counts[[i]] <- alleles_private %>%
    filter(count > 0) %>%
    count(population) %>%
    mutate(rep = i)
  
}

results <- ldply(priv_allele_counts, data.frame)

write_delim(results, "results/priv-alleles_wild_ind21_reps100.txt",
            delim = "\t")

Calculate the mean number of private alleles across all bootstraps.

Number of private alleles occurring in each set of individualsgrouped by run type within tributaries. All populations except for S_FRHwere subsampled down to 15 individuals.

population mean sd
Fall 861.22 33.61234
Late-Fall 807.24 21.58381
Winter 385.09 14.27090
Spring 1094.61 58.49543

3.12.1 Session Information

Package versions used for this analysis (R 4.0.5).

##             package loadedversion
## ade4           ade4        1.7-19
## adegenet   adegenet         2.1.7
## ape             ape         5.6-2
## assigner   assigner         0.5.8
## coin           coin         1.4-3
## dplyr         dplyr         1.0.9
## forcats     forcats         0.5.1
## ggplot2     ggplot2         3.3.6
## glue           glue         1.6.2
## hierfstat hierfstat        0.5-11
## httr           httr         1.4.3
## knitr         knitr          1.39
## lattice     lattice       0.20-45
## magrittr   magrittr         2.0.3
## pegas         pegas           1.1
## permute     permute         0.9-7
## plyr           plyr         1.8.7
## poppr         poppr         2.9.3
## purrr         purrr         0.3.4
## radiator   radiator         1.2.2
## readr         readr         2.1.2
## stringr     stringr         1.4.0
## survival   survival         3.3-1
## tibble       tibble         3.1.7
## tidyr         tidyr         1.2.0
## tidyverse tidyverse         1.3.1
## tint           tint         0.1.3
## tufte         tufte          0.12
## UpSetR       UpSetR         1.4.0
## vcfR           vcfR        1.15.0
## vegan         vegan         2.6-2
Gosselin, T, Eric C Anderson, and Ian R Bradbury. 2016. assigner: Assignment Analysis with GBS/RAD Data using R. R Package. https://doi.org/doi : 10.5281/zenodo.51453.
Hudson, Richard R. 2002. Generating samples under a Wright-Fisher neutral model of genetic variation.” Bioinformatics 18 (2): 337–38. https://doi.org/10.1093/bioinformatics/18.2.337.
Jombart, Thibaut, Sébastien Devillard, and François Balloux. 2010. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations.” BMC Genetics 11 (1): 94. https://doi.org/10.1186/1471-2156-11-94.
Nei, Masatoshi. 1987. Molecular Evolutionary Genetics. https://doi.org/10.7312/nei-92038.
Watterson, G. A. 1975. On the number of segregating sites in genetical models without recombination.” Theoretical Population Biology 7 (2): 256–76. https://doi.org/10.1016/0040-5809(75)90020-9.
Weir, B. S., and C. Clark Cockerham. 1984. Estimating F-Statistics for the Analysis of Population Structure.” Evolution 38 (6): 1358. https://doi.org/10.2307/2408641.